[Maria-developers] Updated (by Knielsen): Efficient group commit for binary log (116)

25 May 2010

-----------------------------------------------------------------------
                              WORKLOG TASK
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
TASK...........: Efficient group commit for binary log
CREATION DATE..: Mon, 26 Apr 2010, 13:28
SUPERVISOR.....: Knielsen
IMPLEMENTOR....: 
COPIES TO......: Serg
CATEGORY.......: Server-RawIdeaBin
TASK ID........: 116 (http://askmonty.org/worklog/?tid=116)
VERSION........: Server-9.x
STATUS.........: Un-Assigned
PRIORITY.......: 60
WORKED HOURS...: 49
ESTIMATE.......: 0 (hours remain)
ORIG. ESTIMATE.: 0

PROGRESS NOTES:

-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High-Level Specification modified.

--- /tmp/wklog.116.old.14249    2010-05-25 13:18:34.000000000 +0000
+++ /tmp/wklog.116.new.14249    2010-05-25 13:18:34.000000000 +0000
@@ -1 +1,157 @@
+The basic idea in group commit is that multiple threads, each handling one
+transaction, prepare for commit and then queue up together waiting to do an
+fsync() on the transaction log. Then once the log is available, a single
+thread does the fsync() + other necessary book-keeping for all of the threads
+at once. After this, the single thread signals the other threads that it's
+done and they can finish up and return success (or failure) from the commit
+operation.
+
+So group commit has a parallel part, and a sequential part. So we need a
+facility for engines/binlog to participate in both the parallel and the
+sequential part.
+
+To do this, we add two new handlerton methods:
+
+    int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
+    void (*commit_ordered)(handlerton *hton, THD *thd, bool all);
+
+The idea is that the existing prepare() and commit() methods run in the
+parallel part of group commit, and the new prepare_ordered() and
+commit_ordered() run in the sequential part.
+
+The prepare_ordered() method is called after prepare(). The order of
+tranctions that call into prepare_ordered() is guaranteed to be the same among
+all storage engines and binlog, and it is serialised so no two calls can be
+running inside the same engine at the same time.
+
+The commit_ordered() method is called before commit(), and similarly is
+guaranteed to have same transaction order in all participants, and to be
+serialised within one engine.
+
+As the prepare_ordered() and commit_ordered() calls are serialised, the idea
+is that handlers should do the minimum amount of work needed in these calls,
+relaying most of the work (eg. fsync() ...) to prepare() and commit().
+
+As a concrete example, for InnoDB the commit_ordered() method will do the
+first part of commit that fixed the commit order in the transaction log
+buffer, and the commit() method will write the log to disk and fsync()
+it. This split already exists inside the InnoDB code, running before
+respectively after releasing the prepare_commit_mutex.
+
+In addition, the XA transaction coordinator (TC_LOG) is special, since it is
+the one responsible for deciding whether to commit or rollback the
+transaction. For this we need an extra method, since this decision can be done
+only after we know that all prepare() and prepare_ordered() calls succeed, and
+must be done to know whether to call commit_ordered()/commit(), or do rollback.
+
+The existing method for this is TC_LOG::log_xid(). To make implementing group
+commit simpler to implement in a transaction coordinator and more efficient,
+we introduce a new method:
+
+    void group_log_xid(THD *first_thd);
+
+This method runs in the sequential part of group commit. It receives a list of
+transactions to perform log_xid() on, in the correct commit order. (Note that
+TC_LOG can do parallel parts of group commit in its own prepare() and commit()
+methods).
+
+This method can make it easier to implement the group commit in TC_LOG, as it
+gets directly the list of transactions in the right order. Without it, it
+might need to compute such order anyway in a prepare_ordered() method, and the
+server has to create this ordered list anyway to implement the order guarantee
+for prepare_ordered() and commit_ordered().
+
+This group_log_xid() method also is more efficient, as it avoids some
+inter-thread synchronisation. Since group_log_xid() is serialised, we can run
+it together with all the commit_ordered() method calls and need only a single
+sequential code section. With the log_xid() methods, we would need first a
+sequential part for the prepare_ordered() calls, then a parallel part with
+log_xid() calls (to not loose group commit ability for log_xid()), then again
+a sequential part for the commit_ordered() method calls.
+
+The extra synchronisation is needed, as each commit_ordered() call will have
+to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
+should not be called), and also wait for commit_ordered() to finish in all
+threads handling earlier commits. In effect we will need to bounce the
+execution from one thread to the other among all participants in the group
+commit.
+
+As a consequence of the group_log_xid() optimisation, handlers must be aware
+that the commit_ordered() call can happen in another thread than the one
+running commit() (so thread local storage is not available). This should not
+be a big issue as the THD is available for storing any needed information.
+
+Since group_log_xid() runs for multiple transactions in a single thread, it
+can not do error reporting (my_error()) as that relies on thread local
+storage. Instead it sets an error code in THD::xid_error, and if there is an
+error then later another method will be called (in correct thread context) to
+actually report the error:
+
+    int xid_delayed_error(THD *thd)
+
+The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
+are optional (as is xid_delayed_error). A storage engine or transaction
+coordinator is free to not implement them if they are not needed. In this case
+there will be no order guarantee for the corresponding stage of group commit
+for that engine. For example, InnoDB needs no ordering of the prepare phase,
+so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
+all, so does not need to implement any of them.
+
+Note in particular that all existing engines (/binlog implementations if they
+exist) will work unmodified (and also without any change in group commit
+facilities or commit order guaranteed).
+
+Using these new APIs, the work will be to
+
+ - In ha_commit_trans(), implement the correct semantics for the three new
+   calls.
+
+ - In XtraDB, use the new commit_ordered() call to remove the
+   prepare_commit_mutex (and resurrect group commit) without loosing the
+   consistency with binlog commit order.
+
+ - In log.cc (binlog module), implement group_log_xid() to do group commit of
+   multiple transactions to the binlog with a single shared fsync() call.
+
+-----------------------------------------------------------------------
+Some possible alternative for this worklog:
+
+ - We could eliminate the group_log_xid() method for a simpler API, at the
+   cost of extra synchronisation between threads to do in-order
+   commit_ordered() method calls. This would also allow to call
+   commit_ordered() in the correct thread context.
+
+ - Alternatively, we could eliminate log_xid() and require that all
+   transaction coordinators implement group_log_xid() instead, again for some
+   moderate simplification.
+
+ - At the moment there is no plugin actually using prepare_ordered(), so, it
+   could be removed from the design. But it fits in well, is efficient to
+   implement, and could be useful later (eg. for the requested feature of
+   releasing locks early in InnoDB).
+
+-----------------------------------------------------------------------
+Some possible follow-up projects after this is implemented:
+
+ - Add statistics about how efficient group commit is (#fsyncs/#commits in
+   each engine and binlog).
+
+ - Implement an XtraDB prepare_ordered() methods that can release row locks
+   early (Mark Callaghan from Facebook advocates this, but need to determine
+   exactly how to do this safely).
+
+ - Implement a new crash recovery algorithm that uses the consistent commit
+   ordering to need only fsync() for the binlog. At crash recovery, any
+   missing transactions in an engine is replayed from the correct point in the
+   binlog (this point must be stored transactionally inside the engine, as
+   XtraDB already does today).
+
+ - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
+   consistent snapshow, with same set of committed and not committed
+   transactions in all engines, 2) returns a corresponding consistent binlog
+   position. This should be easy by piggybacking on the synchronisation
+   implemented for ha_commit_trans().
+
+ - Use this in XtraBackup to get consistent binlog position without having to
+   block all updates with FLUSH TABLES WITH READ LOCK.
 

-=-=(Knielsen - Tue, 25 May 2010, 13:18)=-=-
High Level Description modified.
--- /tmp/wklog.116.old.14234    2010-05-25 13:18:07.000000000 +0000
+++ /tmp/wklog.116.new.14234    2010-05-25 13:18:07.000000000 +0000
@@ -21,3 +21,69 @@
     http://kristiannielsen.livejournal.com/12408.html
     http://kristiannielsen.livejournal.com/12553.html
 
+----
+
+Implementing group commit in MySQL faces some challenges from the handler
+plugin architecture:
+
+1. Because storage engine handlers have separate transaction log from the
+mysql binlog (and from each other), there are multiple fsync() calls per
+commit that need the group commit optimisation (2 per participating storage
+engine + 1 for binlog).
+
+2. The code handling commit is split in several places, in main server code
+and in storage engine code. With pluggable binlog it will be split even
+more. This requires a good abstract yet powerful API to be able to implement
+group commit simply and efficiently in plugins without the different parts
+having to rely on iternals of the others.
+
+3. We want the order of commits to be the same in all engines participating in
+multiple transactions. This requirement is the reason that InnoDB currently
+breaks group commit with the infamous prepare_commit_mutex.
+
+While currently there is no server guarantee to get same commit order in
+engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
+several reasons why this could be desirable:
+
+ - InnoDB hot backup needs to be able to extract a binlog position that is
+   consistent with the hot backup to be able to provision a new slave, and
+   this is impossible without imposing at least partial consistent ordering
+   between InnoDB and binlog.
+
+ - Other backup methods could have similar needs, eg. XtraBackup or
+   `mysqldump --single-transaction`, to have consistent commit order between
+   binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
+   or similar expensive blocking operation. (other backup methods, like LVM
+   snapshot, don't need consistent commit order, as they can restore
+   out-of-order commits during crash recovery using XA).
+
+ - If we have consistent commit order, we can think about optimising commit to
+   need only one fsync (for binlog); lost commits in storage engines can then
+   be recovered from the binlog at crash recovery by re-playing against the
+   engine from a particular point in the binlog.
+
+ - With consistent commit order, we can get better semantics for START
+   TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
+   could even get it to return also a matching binlog position). Currently,
+   this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
+   engines.
+
+ - In InnoDB, the performance in the presense of hotspots can be improved if
+   we can release row locks early in the commit phase, but this requires that we
+release them in
+   the same order as commits in the binlog to ensure consistency between
+   master and slaves.
+
+ - There was some discussions around Galera [1] synchroneous replication and
+   global transaction ID that it needed consistent commit order among
+   participating engines.
+
+ - I believe there could be other applications for guaranteed consistent
+   commit order, and that the architecture described in this worklog can
+   implement such guarantee with reasonable overhead.
+
+
+References:
+
+[1] Galera: http://www.codership.com/products/galera_replication
+

-=-=(Knielsen - Tue, 25 May 2010, 08:28)=-=-
More thoughts on and changes to the archtecture. Got to something now that I am satisfied with and
that seems to be able to handle all issues.

Implement new prepare_ordered and commit_ordered handler methods and the logic in ha_commit_trans().
Implement TC_LOG::group_log_xid() method and logic in ha_commit_trans().
Implement XtraDB part, using commit_ordered() rather than prepare_commit_mutex.
Fix test suite failures.
Proof-of-concept patch series complete now.
Do initial benchmark, getting good results. With 64 threads, see 26x improvement in queries-per-sec.

Next step: write up the architecture description.

Worked 21 hours and estimate 0 hours remain (original estimate increased by 21 hours).

-=-=(Knielsen - Wed, 12 May 2010, 06:41)=-=-
Started work on a Quilt patch series, refactoring the binlog code to prepare for implementing the
group commit, and working on the design of group commit in parallel.

Found and fixed several problems in error handling when writing to binlog.

Removed redundant table map version locking.

Split binlog writing into two parts in preparations for group commit. When ready to write to the
binlog, threads enter a queue, and the first thread in the queue handles the binlog writing for
everyone. When it obtains the LOCK_log, it first loops over all threads, executing the first part of
binlog writing (the write(2) syscall essentially). It then runs the second part (fsync(2)
essentially) only once, and then wakes up the remaining threads in the queue.

Still to be done:

Finish the proof-of-concept group commit patch, by 1) implementing the prepare_fast() and
commit_fast() callbacks in handler.cc 2) move the binlog thread enqueue from log_xid() to
binlog_prepare_fast(), 3) move fast part of InnoDB commit to innobase_commit_fast(), removing the
prepare_commit_mutex().

Write up the final design in this worklog.

Evaluate the design to see if we can do better/different.

Think about possible next steps, such as releasing innodb row locks early (in
innobase_prepare_fast), and doing crash recovery by replaying transactions from the binlog (removing
the need for engine durability and 2 of 3 fsync() in commit).
Worked 28 hours and estimate 0 hours remain (original estimate increased by 28 hours).

-=-=(Serg - Mon, 26 Apr 2010, 14:10)=-=-
Observers changed: Serg



DESCRIPTION:

Currently, in order to ensure that the server can recover after a crash to a
state in which storage engines and binary log are consistent with each other,
it is necessary to use XA with durable commits for both storage engines
(innodb_flush_log_at_trx_commit=1) and binary log (sync_binlog=1).

This is _very_ expensive, since the server needs to do three fsync() operations
for every commit, as there is no working group commit when the binary log is
enabled.

The idea is to

 - Implement/fix group commit to work properly with the binary log enabled.

 - (Optionally) avoid the need to fsync() in the engine, and instead rely on
   replaying any lost transactions from the binary log against the engine
   during crash recovery.

For background see these articles:

    http://kristiannielsen.livejournal.com/12254.html
    http://kristiannielsen.livejournal.com/12408.html
    http://kristiannielsen.livejournal.com/12553.html

----

Implementing group commit in MySQL faces some challenges from the handler
plugin architecture:

1. Because storage engine handlers have separate transaction log from the
mysql binlog (and from each other), there are multiple fsync() calls per
commit that need the group commit optimisation (2 per participating storage
engine + 1 for binlog).

2. The code handling commit is split in several places, in main server code
and in storage engine code. With pluggable binlog it will be split even
more. This requires a good abstract yet powerful API to be able to implement
group commit simply and efficiently in plugins without the different parts
having to rely on iternals of the others.

3. We want the order of commits to be the same in all engines participating in
multiple transactions. This requirement is the reason that InnoDB currently
breaks group commit with the infamous prepare_commit_mutex.

While currently there is no server guarantee to get same commit order in
engines an binlog (except for the InnoDB prepare_commit_mutex hack), there are
several reasons why this could be desirable:

 - InnoDB hot backup needs to be able to extract a binlog position that is
   consistent with the hot backup to be able to provision a new slave, and
   this is impossible without imposing at least partial consistent ordering
   between InnoDB and binlog.

 - Other backup methods could have similar needs, eg. XtraBackup or
   `mysqldump --single-transaction`, to have consistent commit order between
   binlog and storage engines without having to do FLUSH TABLES WITH READ LOCK
   or similar expensive blocking operation. (other backup methods, like LVM
   snapshot, don't need consistent commit order, as they can restore
   out-of-order commits during crash recovery using XA).

 - If we have consistent commit order, we can think about optimising commit to
   need only one fsync (for binlog); lost commits in storage engines can then
   be recovered from the binlog at crash recovery by re-playing against the
   engine from a particular point in the binlog.

 - With consistent commit order, we can get better semantics for START
   TRANSACTION WITH CONSISTENT SNAPSHOT with multi-engine transactions (and we
   could even get it to return also a matching binlog position). Currently,
   this "CONSISTENT SNAPSHOT" can be inconsistent among multiple storage
   engines.

 - In InnoDB, the performance in the presense of hotspots can be improved if
   we can release row locks early in the commit phase, but this requires that we
release them in
   the same order as commits in the binlog to ensure consistency between
   master and slaves.

 - There was some discussions around Galera [1] synchroneous replication and
   global transaction ID that it needed consistent commit order among
   participating engines.

 - I believe there could be other applications for guaranteed consistent
   commit order, and that the architecture described in this worklog can
   implement such guarantee with reasonable overhead.


References:

[1] Galera: http://www.codership.com/products/galera_replication


HIGH-LEVEL SPECIFICATION:



The basic idea in group commit is that multiple threads, each handling one
transaction, prepare for commit and then queue up together waiting to do an
fsync() on the transaction log. Then once the log is available, a single
thread does the fsync() + other necessary book-keeping for all of the threads
at once. After this, the single thread signals the other threads that it's
done and they can finish up and return success (or failure) from the commit
operation.

So group commit has a parallel part, and a sequential part. So we need a
facility for engines/binlog to participate in both the parallel and the
sequential part.

To do this, we add two new handlerton methods:

    int (*prepare_ordered)(handlerton *hton, THD *thd, bool all);
    void (*commit_ordered)(handlerton *hton, THD *thd, bool all);

The idea is that the existing prepare() and commit() methods run in the
parallel part of group commit, and the new prepare_ordered() and
commit_ordered() run in the sequential part.

The prepare_ordered() method is called after prepare(). The order of
tranctions that call into prepare_ordered() is guaranteed to be the same among
all storage engines and binlog, and it is serialised so no two calls can be
running inside the same engine at the same time.

The commit_ordered() method is called before commit(), and similarly is
guaranteed to have same transaction order in all participants, and to be
serialised within one engine.

As the prepare_ordered() and commit_ordered() calls are serialised, the idea
is that handlers should do the minimum amount of work needed in these calls,
relaying most of the work (eg. fsync() ...) to prepare() and commit().

As a concrete example, for InnoDB the commit_ordered() method will do the
first part of commit that fixed the commit order in the transaction log
buffer, and the commit() method will write the log to disk and fsync()
it. This split already exists inside the InnoDB code, running before
respectively after releasing the prepare_commit_mutex.

In addition, the XA transaction coordinator (TC_LOG) is special, since it is
the one responsible for deciding whether to commit or rollback the
transaction. For this we need an extra method, since this decision can be done
only after we know that all prepare() and prepare_ordered() calls succeed, and
must be done to know whether to call commit_ordered()/commit(), or do rollback.

The existing method for this is TC_LOG::log_xid(). To make implementing group
commit simpler to implement in a transaction coordinator and more efficient,
we introduce a new method:

    void group_log_xid(THD *first_thd);

This method runs in the sequential part of group commit. It receives a list of
transactions to perform log_xid() on, in the correct commit order. (Note that
TC_LOG can do parallel parts of group commit in its own prepare() and commit()
methods).

This method can make it easier to implement the group commit in TC_LOG, as it
gets directly the list of transactions in the right order. Without it, it
might need to compute such order anyway in a prepare_ordered() method, and the
server has to create this ordered list anyway to implement the order guarantee
for prepare_ordered() and commit_ordered().

This group_log_xid() method also is more efficient, as it avoids some
inter-thread synchronisation. Since group_log_xid() is serialised, we can run
it together with all the commit_ordered() method calls and need only a single
sequential code section. With the log_xid() methods, we would need first a
sequential part for the prepare_ordered() calls, then a parallel part with
log_xid() calls (to not loose group commit ability for log_xid()), then again
a sequential part for the commit_ordered() method calls.

The extra synchronisation is needed, as each commit_ordered() call will have
to wait for log_xid() in one thread (if log_xid() fails then commit_ordered()
should not be called), and also wait for commit_ordered() to finish in all
threads handling earlier commits. In effect we will need to bounce the
execution from one thread to the other among all participants in the group
commit.

As a consequence of the group_log_xid() optimisation, handlers must be aware
that the commit_ordered() call can happen in another thread than the one
running commit() (so thread local storage is not available). This should not
be a big issue as the THD is available for storing any needed information.

Since group_log_xid() runs for multiple transactions in a single thread, it
can not do error reporting (my_error()) as that relies on thread local
storage. Instead it sets an error code in THD::xid_error, and if there is an
error then later another method will be called (in correct thread context) to
actually report the error:

    int xid_delayed_error(THD *thd)

The three new methods prepare_ordered(), group_log_xid(), and commit_ordered()
are optional (as is xid_delayed_error). A storage engine or transaction
coordinator is free to not implement them if they are not needed. In this case
there will be no order guarantee for the corresponding stage of group commit
for that engine. For example, InnoDB needs no ordering of the prepare phase,
so can omit implementing prepare_ordered(); TC_LOG_MMAP needs no ordering at
all, so does not need to implement any of them.

Note in particular that all existing engines (/binlog implementations if they
exist) will work unmodified (and also without any change in group commit
facilities or commit order guaranteed).

Using these new APIs, the work will be to

 - In ha_commit_trans(), implement the correct semantics for the three new
   calls.

 - In XtraDB, use the new commit_ordered() call to remove the
   prepare_commit_mutex (and resurrect group commit) without loosing the
   consistency with binlog commit order.

 - In log.cc (binlog module), implement group_log_xid() to do group commit of
   multiple transactions to the binlog with a single shared fsync() call.

-----------------------------------------------------------------------
Some possible alternative for this worklog:

 - We could eliminate the group_log_xid() method for a simpler API, at the
   cost of extra synchronisation between threads to do in-order
   commit_ordered() method calls. This would also allow to call
   commit_ordered() in the correct thread context.

 - Alternatively, we could eliminate log_xid() and require that all
   transaction coordinators implement group_log_xid() instead, again for some
   moderate simplification.

 - At the moment there is no plugin actually using prepare_ordered(), so, it
   could be removed from the design. But it fits in well, is efficient to
   implement, and could be useful later (eg. for the requested feature of
   releasing locks early in InnoDB).

-----------------------------------------------------------------------
Some possible follow-up projects after this is implemented:

 - Add statistics about how efficient group commit is (#fsyncs/#commits in
   each engine and binlog).

 - Implement an XtraDB prepare_ordered() methods that can release row locks
   early (Mark Callaghan from Facebook advocates this, but need to determine
   exactly how to do this safely).

 - Implement a new crash recovery algorithm that uses the consistent commit
   ordering to need only fsync() for the binlog. At crash recovery, any
   missing transactions in an engine is replayed from the correct point in the
   binlog (this point must be stored transactionally inside the engine, as
   XtraDB already does today).

 - Implement that START TRANSACTION WITH CONSISTENT SNAPSHOT 1) really gets a
   consistent snapshow, with same set of committed and not committed
   transactions in all engines, 2) returns a corresponding consistent binlog
   position. This should be easy by piggybacking on the synchronisation
   implemented for ha_commit_trans().

 - Use this in XtraBackup to get consistent binlog position without having to
   block all updates with FLUSH TABLES WITH READ LOCK.


ESTIMATED WORK TIME

ESTIMATED COMPLETION DATE
-----------------------------------------------------------------------
WorkLog (v3.5.9)

    

worklog-noreply＠askmonty.org

tags

participants (1)