![](https://secure.gravatar.com/avatar/99fde0c1dfd216326aae0aff30d493f1.jpg?s=120&d=mm&r=g)
i don't understand what "only_commits" is, how can it prepare transaction "queue up" for group commit if only 1 is prepared in parallel ?
Right, so this is an important optimisation, let me try to explain that, and then I'll answer the other points in a different (shorter) mail for clarity. Parallel replication in MariaDB works by having the SQL driver thread schedule each transaction on a pool of worker threads in a round-robin fashion: worker1 worker2 worker3 worker4 T1 T2 T3 T4 T5 T6 T7 T8 T9 ... So in this sense, transactions are always applied in parallel threads when @@slave_parallel_threads > 0 and @@slave_parallel_mode != none. But one transaction may need to wait for another at various points to avoid conflicts. The various modes of parallel replication differ in how they do these waits. Considering a transaction to be applied, there are three points of interest during its replication: T: BEGIN; <P1: transaction start> INSERT ... UPDATE ... <P2: just before commit> COMMIT; <P3: just after commit> The COMMIT step involves updating the slave GTID position and writing to the binlog (if --log-slave-updates) as well as the InnoDB commit proper. So consider now transactions T1 and T2, T1 committed first on the master: 1. If T1 and T2 have different replication domains, and --slave-parallel-domains is ON, then they are applied in parallel without any restrictions; T2 does not wait for T1 at any of the points. This is out-of-order replication, T2 is allowed to commit before T1 on the slave. 2. Otherwise, we have in-order parallel replication, T2 will always commit after T1 on the slave. Thus, during commit, T2 will wait at its point P3 for T1 to reach P3 first. This wait is done in a way such that T1 can do a group commit of both T1 and T2 simultaneously. 3. If we have --slave-parallel-mode=only_commits, then T2 will at point P1 wait for T1 to reach point P2. This means that no query of T2 can run in parallel with any query of T1. However, all of T2 can run in parallel with the COMMIT step of T1. In particular, the COMMIT step of T2 can run in parallel with COMMIT of T1; this is what makes it possible to do group commit in MariaDB parallel replication, even without running the queries themselves in parallel. (If there is an actual row lock conflict between T1 and T2 (maybe they update the same row), then T2 will wait for T1 inside InnoDB, and group commit will not be possible. However, as long as T1 has reached its commit step, it is safe to start T2; T2 will not be able to cause T1 to wait for fail, at most T2 will just have to wait). 4. Consider now --slave-parallel-mode=follow_master_commit. In this case, if T1 and T2 were group committed together on the master (same commit_id in GTID), then at point P1 T2 will not wait for T1; T1 and T2 will be allowed to run in parallel. If we have two different group commits (T1 T2 T3) and (T4 T5) on the master, then at point P1, T4 and T5 will wait for all of T1, T2, and T3 to reach their point P2. Since T4 and T5 were group-committed together, they are conflict-free and safe to run in parallel. But we do not know if they might conflict with T1, T2, or T3, so we make sure those have reached the COMMIT step first. 5. In --slave-parallel-mode=all_transactions, we relax this even further. T2 will _not_ wait at point P1, even if it has a different commit id from T1. This is the new optimistic parallel replication mode. In this case, we might get a conflict, if first T2 modifies a row and then later T1 needs to modify this row. The result will be a deadlock; T2 gets to point P3 and waits for T1 to commit, but T1 is inside some query waiting for T2 to commit and release its row locks. The deadlock is detected, T2 is killed and rolled back, so that T1 can proceed. We then retry transaction T2. Before retrying T2, we wait for T1 to reach its point P3 first (the idea is that we now _know_ that T1 and T2 conflict; so there seems little point in trying to do them in parallel). (A conflict can also be detected as a different kind of error, for example duplicate key violation if T1 deletes a row and T2 inserts the same row. Such errors are handled the same way). 6. If T2 is not transactional (eg. MyISAM update), then it is not safe to attempt to run it in parallel with T1. So at point P1 in T2, we wait for T1 to reach point P3. (It would be enough to wait for all prior transactions to reach their point P2, but this would be more complex (=costly) to keep track of, and in optimistic parallel replication we would expect few MyISAM updates, if any). Note that a following InnoDB T3 could run in parallel with both T1 and T2. 7. If T2 is DDL, then it is not safe to run in parallel with neither T1, nor a following T3 (for example, T2 might ALTER from InnoDB to MyISAM a table modified by T1 or T3). In this case, T2 will at point P1 wait for all prior transactions to reach their point P2. and T3 will likewise at point P1 wait for T2 to reach its point P2. ----------------------------------------------------------------------- So those are the details of how the different modes actually work. The optimistic parallel replication approach looks really promising, I think. It has the potential to use _all_ the parallelism available in the replication stream, also more than what was possible on the master. There are two main limitations: (A) Whenever parallel apply is not possible due to a conflict, we run the risk of having to rollback and retry a transaction. So if this happens too frequently, the cost of this might outweight the benefit from parallel apply. (B) We commit in-order. If we have 10 worker threads, this means that we cannot start T11 before T1 has completed - all the worker threads will be occupied with the transactions T1, ..., T10. Thus, while some long-running transaction T1 is being replicated, we can at most do work N transactions ahead, where N is the value of --slave-parallel-threads. What I want to achieve is that optimistic parallel replication will work well out-of-the-box in most cases, _and_ to provide some way for the DBA to tune it to overcome limitations (A) and (B), in special cases where it will be needed. This is the motivation for the extra options (that I agree tend to look a bit out of place). For example, an application may have a hot-spot like a single row that is updated by a lot of transactions. Such row could become even more hot on an optimistic parallel slave, where retries will increase the contention on the row (limitation (A)). So we can try to detect such hot rows on the master (--slave-parallel-wait-if-conflict-on-master). And we can allow the application to explicitly declare such hot-spot updates (@@SESSION.replicate_expect_conflicts). Or imagine a common operation like: CREATE TABLE t1 ( ... ); SET SESSION replicate_expect_conflicts = 1; INSERT INTO t1 VALUES (1, ...); INSERT INTO t1 VALUES (2, ...); INSERT INTO t1 VALUES (3, ...); ... All the inserts are very likely to fail if attempted in parallel with the CREATE TABLE. If this becomes a bottleneck, it seems useful to have a way for the application to work around it. For limitation (B), I provide the --slave-parallel-domains options. If a long-running operation would block the slave for too long, it can be put into a different replication domain: SET SESSION gtid_domain_id=2; ALTER large_table ... ; Then the operation can run freely in parallel with other operations. But then the application / DBA is responsible for making sure that the operation has completed on all slaves before attempting any following operations that might conflict with it.
I still prefer "auto" as default,
Indeed, I agree, that is where I want to go as well. I guess I'm just being a bit humble here; this is new stuff, and frankly not well tested in practice yet. I feel that I need to start with something that has fine-grained control, to get some real-life experience about what works - and that will make it easier for me to understand how an "auto" mode should actually be implemented. (But I do like your earlier suggestion of just creating the "auto" option now to mean something - and then that can be later refined. I will try to get that into the first version somehow). For me, I'm still at the stage where I want to just understand if the optimistic approach is even semantically correct, before being ready to truly consider the finer points of tuning. But it should work - and even if we manage to find some corner cases where it breaks, it seems it should be possible to handle by forcing serialisation, similar to the DDL case. MySQL/MariaDB has the property of fully supporting the 'I' in ACID, unlike some other popular databases - this is what makes statement-based replication possible. Given some arbitrary transactions that run in parallel, and assume that they commit in the order T1, T2, T3, ... Then we know that if we were to run those transaction _serially_, we would end up with the exact same state of the database, no matter in what order the operations in T1, T2, T3, ... were originally run. On the parallel slave, this means that as long as we commit T1, T2, T3, ... in the same order as was done on the master, we will get the same results - if the execution of the transactions is successful. If there is a conflict, we might get a failure or a deadlock, but silently wrong data is not possible. So this is what should make the optimistic approach correct, as long as we handle (with rollback and retry) any failures and deadlocks. Thanks for reading so far ;-) - Kristian.