[Maria-developers] Detailed explanation of the operation of optimistic parallel replication

5 Dec 2014

      ...
i don't understand what "only_commits" is, how can it prepare transaction
"queue up" for group commit
if only 1 is prepared in parallel ?
Right, so this is an important optimisation, let me try to explain that, and
then I'll answer the other points in a different (shorter) mail for clarity.

Parallel replication in MariaDB works by having the SQL driver thread schedule
each transaction on a pool of worker threads in a round-robin fashion:

   worker1   worker2   worker3   worker4
     T1        T2        T3        T4
     T5        T6        T7        T8
     T9        ...

So in this sense, transactions are always applied in parallel threads when
@@slave_parallel_threads > 0 and @@slave_parallel_mode != none. But one
transaction may need to wait for another at various points to avoid
conflicts. The various modes of parallel replication differ in how they do
these waits.

Considering a transaction to be applied, there are three points of interest
during its replication:

   T:  BEGIN;
       <P1: transaction start>
       INSERT ...
       UPDATE ...
       <P2: just before commit>
       COMMIT;
       <P3: just after commit>

The COMMIT step involves updating the slave GTID position and writing to the
binlog (if --log-slave-updates) as well as the InnoDB commit proper.

So consider now transactions T1 and T2, T1 committed first on the master:

1. If T1 and T2 have different replication domains, and
--slave-parallel-domains is ON, then they are applied in parallel without any
restrictions; T2 does not wait for T1 at any of the points. This is
out-of-order replication, T2 is allowed to commit before T1 on the slave.

2. Otherwise, we have in-order parallel replication, T2 will always commit
after T1 on the slave. Thus, during commit, T2 will wait at its point P3 for
T1 to reach P3 first. This wait is done in a way such that T1 can do a group
commit of both T1 and T2 simultaneously.

3. If we have --slave-parallel-mode=only_commits, then T2 will at point P1
wait for T1 to reach point P2. This means that no query of T2 can run in
parallel with any query of T1. However, all of T2 can run in parallel with the
COMMIT step of T1. In particular, the COMMIT step of T2 can run in parallel
with COMMIT of T1; this is what makes it possible to do group commit in
MariaDB parallel replication, even without running the queries themselves in
parallel.

(If there is an actual row lock conflict between T1 and T2 (maybe they update
the same row), then T2 will wait for T1 inside InnoDB, and group commit will
not be possible. However, as long as T1 has reached its commit step, it is
safe to start T2; T2 will not be able to cause T1 to wait for fail, at most T2
will just have to wait).

4. Consider now --slave-parallel-mode=follow_master_commit. In this case, if
T1 and T2 were group committed together on the master (same commit_id in
GTID), then at point P1 T2 will not wait for T1; T1 and T2 will be allowed to
run in parallel. If we have two different group commits (T1 T2 T3) and (T4 T5)
on the master, then at point P1, T4 and T5 will wait for all of T1, T2, and T3
to reach their point P2. Since T4 and T5 were group-committed together, they
are conflict-free and safe to run in parallel. But we do not know if they
might conflict with T1, T2, or T3, so we make sure those have reached the
COMMIT step first.

5. In --slave-parallel-mode=all_transactions, we relax this even further. T2
will _not_ wait at point P1, even if it has a different commit id from
T1. This is the new optimistic parallel replication mode.

In this case, we might get a conflict, if first T2 modifies a row and then
later T1 needs to modify this row. The result will be a deadlock; T2 gets to
point P3 and waits for T1 to commit, but T1 is inside some query waiting for
T2 to commit and release its row locks. The deadlock is detected, T2 is killed
and rolled back, so that T1 can proceed. We then retry transaction T2. Before
retrying T2, we wait for T1 to reach its point P3 first (the idea is that we
now _know_ that T1 and T2 conflict; so there seems little point in trying to
do them in parallel).

(A conflict can also be detected as a different kind of error, for example
duplicate key violation if T1 deletes a row and T2 inserts the same row. Such
errors are handled the same way).

6. If T2 is not transactional (eg. MyISAM update), then it is not safe to
attempt to run it in parallel with T1. So at point P1 in T2, we wait for T1 to
reach point P3. (It would be enough to wait for all prior transactions to
reach their point P2, but this would be more complex (=costly) to keep track
of, and in optimistic parallel replication we would expect few MyISAM updates,
if any). Note that a following InnoDB T3 could run in parallel with both T1
and T2.

7. If T2 is DDL, then it is not safe to run in parallel with neither T1, nor a
following T3 (for example, T2 might ALTER from InnoDB to MyISAM a table
modified by T1 or T3). In this case, T2 will at point P1 wait for all prior
transactions to reach their point P2. and T3 will likewise at point P1 wait
for T2 to reach its point P2.

-----------------------------------------------------------------------

So those are the details of how the different modes actually work. 

The optimistic parallel replication approach looks really promising, I
think. It has the potential to use _all_ the parallelism available in the
replication stream, also more than what was possible on the master.

There are two main limitations:

(A) Whenever parallel apply is not possible due to a conflict, we run the risk
of having to rollback and retry a transaction. So if this happens too
frequently, the cost of this might outweight the benefit from parallel apply.

(B) We commit in-order. If we have 10 worker threads, this means that we
cannot start T11 before T1 has completed - all the worker threads will be
occupied with the transactions T1, ..., T10. Thus, while some long-running
transaction T1 is being replicated, we can at most do work N transactions
ahead, where N is the value of --slave-parallel-threads.

What I want to achieve is that optimistic parallel replication will work well
out-of-the-box in most cases, _and_ to provide some way for the DBA to tune it
to overcome limitations (A) and (B), in special cases where it will be
needed. This is the motivation for the extra options (that I agree tend to
look a bit out of place).

For example, an application may have a hot-spot like a single row that is
updated by a lot of transactions. Such row could become even more hot on an
optimistic parallel slave, where retries will increase the contention on the
row (limitation (A)). So we can try to detect such hot rows on the master
(--slave-parallel-wait-if-conflict-on-master). And we can allow the
application to explicitly declare such hot-spot updates
(@@SESSION.replicate_expect_conflicts).

Or imagine a common operation like:

   CREATE TABLE t1 ( ... );
   SET SESSION replicate_expect_conflicts = 1;
   INSERT INTO t1 VALUES (1, ...);
   INSERT INTO t1 VALUES (2, ...);
   INSERT INTO t1 VALUES (3, ...);
   ...

All the inserts are very likely to fail if attempted in parallel with the
CREATE TABLE. If this becomes a bottleneck, it seems useful to have a way for
the application to work around it.

For limitation (B), I provide the --slave-parallel-domains options. If a
long-running operation would block the slave for too long, it can be put into
a different replication domain:

   SET SESSION gtid_domain_id=2;
   ALTER large_table ... ;

Then the operation can run freely in parallel with other operations. But then
the application / DBA is responsible for making sure that the operation has
completed on all slaves before attempting any following operations that might
conflict with it.
...
I still prefer "auto" as default,
Indeed, I agree, that is where I want to go as well. I guess I'm just being a
bit humble here; this is new stuff, and frankly not well tested in practice
yet. I feel that I need to start with something that has fine-grained control,
to get some real-life experience about what works - and that will make it
easier for me to understand how an "auto" mode should actually be implemented.

(But I do like your earlier suggestion of just creating the "auto" option now
to mean something - and then that can be later refined. I will try to get that
into the first version somehow).

For me, I'm still at the stage where I want to just understand if the
optimistic approach is even semantically correct, before being ready to truly
consider the finer points of tuning. But it should work - and even if we
manage to find some corner cases where it breaks, it seems it should be
possible to handle by forcing serialisation, similar to the DDL case.

MySQL/MariaDB has the property of fully supporting the 'I' in ACID, unlike
some other popular databases - this is what makes statement-based replication
possible. Given some arbitrary transactions that run in parallel, and assume
that they commit in the order T1, T2, T3, ... Then we know that if we were to
run those transaction _serially_, we would end up with the exact same state of
the database, no matter in what order the operations in T1, T2, T3, ... were
originally run.

On the parallel slave, this means that as long as we commit T1, T2, T3, ... in
the same order as was done on the master, we will get the same results - if
the execution of the transactions is successful. If there is a conflict, we
might get a failure or a deadlock, but silently wrong data is not possible. So
this is what should make the optimistic approach correct, as long as we handle
(with rollback and retry) any failures and deadlocks.

Thanks for reading so far ;-)

 - Kristian.

[Maria-developers] Detailed explanation of the operation of optimistic parallel replication

Kristian Nielsen