[Maria-developers] Parallel replication benchmarks

6 Mar 2014

      I've done a set of benchmarks for parallel replication on the same machine I
used previously for my group commit benchmarks,
http://kristiannielsen.livejournal.com/16382.html

The code tested is the newest code in the bzr repository and what will be in
10.0.9 (this is significantly improved from what is in 10.0.8).

I plan to write up a blog post about it in a couple of days with nice graphs,
but meanwhile Axel asked me to summarise in this mail.

I tested with sysbench 0.5, using oltp.lua (medium-sized transactions) and
update_index.lua (minimal transactions with just a single primary-key update
per transaction). I used 10M rows, 16GB buffer pool and 2 * 1.9 GB redo logs.
This is with a single table.

I tested simply by preparing the binlog on the master, then setting up an
already prepared slave and doing START SLAVE UNTIL the end of the log. The
error log then shows the time spent for the slave to catch up. I tested
everyting in GTID mode, as that is the recommended mode for parallel
replication (though my guess is that old-style replication will be much the
same, there isn't much difference in the code between what is done to actually
execute events).

Node that these tests are for in-order parallel replication. All commits on
all slaves happen in the same order as on the master; the use of parallel
replication is invisible to applications. This is in contrast to eg. MySQL 5.6
multi-threaded slave, which requires the application to partition its data
into independent schemas.

Here are the prelimiary results, in number of seconds for the slave to catch
up (lower is better) versus number of threads (--slave-parallel-threads, 0
means not using parallel replication):

For oltp.lua. 48 threads used to generate the load on the master, and
--binlog-commit-wait-count=12 --binlog-commit-wait-usec=10000 to allow the
server to delay a commit by up to 10 milliseconds in order to get more group
commit and thus more opportunities for parallel apply on the slave:

A: --log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1
B: --skip-log-slave-updates --innodb-flush-log-at-trx-commit=1
C: --skip-log-bin --innodb-flush-log-at-trx-commit=2
D: --log-bin=master-bin --sync-binlog=0 --innodb-flush-log-at-trx-commit=0

#thr    A       B       C       D
 0    1065     869     193     202
 2     361     432     147     161
 4     221     264     118     121
 8     135     177     103     107
12     114     153     104     105
16     109     140     104     107
24     111     139     107     105
32     111     136      99     109
48     111     126     108     109
64     111     121      99     111

We see here a 2-10 times speedup from parallel replication. The master has
around 12 transactions in every group commit, which provides good
opportunities for parallelism on the slave.

Note that parallel replication is especially effective when the binlog is
enabled and crash-safe (--sync-binlog=1
--innodb-flush-log-at-trx-commit=1). This is because parallel replication can
run the commit of one transaction in parallel with any other transaction, even
if the two transactions would otherwise conflict. This makes group commit
especially effective. In fact, this manages to more or less completely
eliminate any penalty for enabling crash-safe binlog on the slave, which is
quite nice.

Note also that disabling the binlog actually tends to make things _slower_,
not faster, when using parallel replication. I believe this is due to
MDEV-5802, which may be worth fixing for 10.0.

Here are results for update_simple.lua with 48 threads on the master. This
produced around 13 transactions per group commit on the master, with no
--binlog-commit-wait-count to delay commits:

A: --log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1
B: --skip-log-slave-updates --innodb-flush-log-at-trx-commit=1
C: --skip-log-bin --innodb-flush-log-at-trx-commit=2

#thr    A      B      C
 0     931    899    271
 2     546    653    258
 4     365    494    176
 8     261    365    203
12     247    350    197
16     233    336    207
24     242    316    209
32     237    292    194
48     235    270    208
64     228    249    195

Again we get a good speedup from parallel replication, even though with such
small transactions, there is less opportunity for improvement, as the actual
work for transactions is rather small compared to the overhead for managing
the replication of each event. And again, the ability to utilise group commit
effectively provides the biggest benefit.

Finally, I tried a test of update_index.lua where I ran the load on the master
single-threadedly. This creates a binlog with _no_ opportunities for
parallelism from group commits - each transaction needs to be executed on its
own by the slave, as we do not know for sure that they will not conflict on
row locks. However, due to the possibility to run the commits in parallel (and
hence get group commit on the slave), we still see some speedup even here when
--sync-binlog=1 and --innodb-flush-log-at-trx-commit=1. When binlog and innodb
sync is disabled, parallel replication makes things slower due to the overhead
of thread communication:

A: --log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1
B: --skip-log-slave-updates  --innodb-flush-log-at-trx-commit=1
C: --skip-log-bin --innodb-flush-log-at-trx-commit=2

#thr    A      B      C
 0    1075    949    270
 2     597    673    319
 4     443    623    334
 8     407    588    349
12     393    544    336
16     391    536    352
24      -     492    336
32     389    472    358
48     389    419    344
64     391    399    354

So overall, results look very good, especially for slaves with binlog
enabled. (And binlog disabled could turn out better if MDEV-5802 is fixed).

Let me know if there are any questions, and I'll be happy to answer them.

 - Kristian.

[Maria-developers] Parallel replication benchmarks

Kristian Nielsen