Salute, Kristian.
andrei.elkin@pp.inet.fi writes:
We can notice that in the pipeline method any attempt to start waiting for a granted lock to another branch needs interception and erroring out. We are bound to this because the grated lock is to be released only after 2pc.
So this is essentially what thd_rpl_deadlock_check() is used for. It allows the parallel replication code to see all lock waits in storage engines. Suppose T2 occurs after T1 in the binlog. If T2 waits for T1 that is fine, T2 would in any case need to wait for T1 to commit first. But if T1 waits for T2, that is not allowed, T1 must commit first. So we kill T2, allowing T1 to continue. T2 is then re-tried once T1 committed.
So I think there might be a quick'n'dirty way to get a proof-of-concept of the idea of replicating in parallel different statements in a single transaction. In rpl_parallel::do_event(), just duplicate all (transactional) GTID and commit/XID events. Then optimistic parallel replication will just run all the statements individually in parallel, and in case of any dependencies roll back and retry.
Though retry_event_group will need to be hacked as well so it knows to re-try only its own single statement in an event group. And maybe something else will break... but if it works, it could be a very quick way to do some benchmarking on various workloads to get an idea of the potential benefit, before starting to tackle all the difficult consistency requirements etc.
Just to add up this type of deadlock does not necessarily involve ordered committing as a resource. It could be also possible due to various engine "features" (e.g observable in Innodb of "asymmetric" lock conflict). Notice, this sort of conflict is *never* about modifying the same data, mere to access a its location. When it comes to the pipeline method let me show that is actually deadlock-free when is elaborated the following way. So we respect intra-sub-transaction dependencies: all of a sub-transaction's statements on the same table are scheduled to the same branch. That is if the sub-transaction T(t1,t2) dealt with two tables there could be at most 2 slave branches B1(t1), B2(t2). And if there are two sub-transactions T1(t1,t2), T2(t1,t2) there could be up to four B1(t1),B2(t2),B3(t1),B4(t2). In theory B1,B3 and B2,B4 are pair-wise non-conflicting but we must suspect in practice they do. We rate as conflict any lock request (e.g by B3) that would not be granted as it's alreay granted to another branch (that would be B1). On the *concept* level, when that happens indeed B3's statement returns with an error, its work - the statement's (!) is rolled back, and the statement is passed to B1. Let's *refine* it right away as the whole statement replying by B1 is not proven to succeed when it consists of multiple record operations. Instead of erroring by B3 its problematic sub-statement is split out and is passed. B3 goes on to the following sub-statement or statements. B1 will find the sub-statement queued to it for execution and it *must*^{#f1} succeed. It's implied (of not being so obvious though) that it's safe to extract a sub-statement from the Rows-log-event statement. The sub-statement is passed as a dynamically created single-record Rows-log-event. To sum up there hardly would exist any deadlock and retry. And despite inability of "scattering" of the original transaction's statements, the fair load distribution remains promising, even in apparent unfavorable cases. The described lock wait conflict resolution hints at viable sub-statement level parallelization. That is when a Rows-log-event contains a big number of costly sub-statements, noticing that the sub-statements are independent, the event could be executed in parallel with spliting its statement into sub-statement ranges and delegating them to different branches (workers). Example: LOAD-DATA's Write_log_events.
Upon the statement errors out it will be reassigned to the lock grantee worker. If along this way the group still could not be finished, we could rollback all branches and turn to conservative scheduler, eventually to the sequential one.
Footnote #f1: [this remains as last resort]
So this is the idea that the scheduler will first analyse event groups for dependencies, and then schedule them in a conflict-free way to worker threads. The main problem to attack then becomes one of discovering dependencies in a scalable way.
For a long time, this is also how I thought of the problem. It's not just conservative mode, there were many ideas for doing this, some going back more than 10 years. There were also patches for this, for example a very interesting patch (not by me, should be in mailing list archive) that analyzed row-events and scheduled conflicting transactions to the same branch.
I vaguely remember this type of projects for mysqlbinlog.
But eventually I've reached the conclusion that optimistic mode is just the better approach. The main reason is that it provides very good opportunities for parallel apply (and hence speedup), while _also_ providing very strong consistency guarantees for existing (and new) applications. In more detail:
1. Optimistic replication has the best potential to discover opportunities for parallel execution - it can run _anything_ in parallel that does not conflict.
It's universal, admitted.
In can even run many transactions in parallel that _do_ conflict! Even if transactions T1 and T2 have a conflict, there is still a chance that T1 will grab the conflicting lock first, and the first part of T2 can successfully replicate in parallel with T1.
2. The in-order commit provides very strong and robust consistency guarantees, both on data integrity (eg. in case of crash), and on read-consistency (a query on slave will never see a state of data that did not also exist on master). It is important not to under-estimate how important this point can be for users to have confidence in using the parallel replication.
3. When dependencies do prevent parallel execution of T1 and T2 (and we have to kill and re-try T2), optimistic parallel replication still acts as a pre-fetcher to speed up I/O-bound workloads. I/O latencies are so large as to often dwarf any CPU bottleneck, and disk systems often require multiple outstanding I/O requests to scale well.
4. Optimistic parallel replication re-uses existing mechanisms - it uses the normal query execution path in storage engines to discover conflicts, and normal thread kill/rollback is used to handle conflicts. It avoids all the complexity of trying to duplicate the exact conflict semantics in the dependency-checking algorithms. It avoids the need to be overly conservative in the dependency-checking to ensure safety (eg. it trivially is able to parallelise non-conflicting statement-based transactions on the same single table).
And in the dependency-checking approach, if you let just one corner case through that you need to handle during execution - then you end up implementing the full optimistic conflict handling _anyway_ (as I learned the hard way :-).
Specifically to the outlined paragraph you must mean the necessity of rollback-reply in case something "unexpectedly" slips out of control? [To that part, I left the conservative method as backup.]
There are a few limitations to optimistic parallel replication that I first thought would be quite severe. But experience seems to show they are not; here are my thoughts on what makes this so:
1. Optimistic parallel replication increases the rollback rate, sometimes dramatically so. I think the point is that this rollback also happens in parallel. Even a small increase (eg. 10%) in replication throughput can make the difference between a slave that keeps up and one that lags behind, and this is well worth the cost of increasing CPU usage with a core or two, modern servers usually have plenty cores to spare. And rollback+retry typically does not incur much extra I/O overhead (eg. pre-fetching).
2. In-order commits means that look-ahead past a long-running transaction is limited by how many pending commits we can have outstanding. But as Jean-François demonstrated in his benchmarks, optimistic parallel replication can have thousands of outstanding commits and still scale. If the slave is keeping up (and hopefully it is!), there probably are not thousands of pending transactions to work in on parallel anyway. And even approaches that do allow out-of-order commit probably need some limit on look-ahead anyway
(eg. I think MySQL uses a "checkpoint" mechanism.)
To cope with gaps in the parallelization range ('pluralization' in one of mails thanks to my ispell "creative" selection :-)) you mean?
3. In-order parallel replication is limited to inter-transaction parallelism, it cannot do intra-transaction parallel apply.
Still could be approximated though with collapsing sub-transactions to create roughly equal size branches.
Thus if the workload mainly consists of large transactions with conflicts between them, speedup is limited to prefetching. Well, this is exactly the limitation that your idea is addressing! Which is one reason I found it so interesting to learn about.
I pointed earlier that the pipeline is orthogonal to optimistic, it clashes with the conservative's in-order commit. The optimistic fares best of course with sub-transactions of equal sizes which pipelinig aims to provide. So we shall expect maximum performance of combining the two.
I really think parallel replication should *just*work* out-of-the-box for the "normal" user. This is 2018, AMD server CPUs come with 32 cores, the other day I saw a raspberry-pi-lookalike board with 6 cores. A decent, robust implementation of parallel replication needs to be enabled by default. The user shouldn't need to review all of their applications to check if they can tolerate different read consistencies from on the master, let alone re-write them to spread queries over multiple tables or databases.
The optimistic parallel replication is the first method I have seen that can properly satisfy this - give decent speedup on most real-life workloads, in a way that is transparent to applications. Users can just enable optimistic parallel replication, and if it fails in some way, it is a bug that should be fixed.
(How about enabling it by default in 10.4, BTW?)
Personally I don't have objection. It's quite stable already.
Of course, 10.3 MariaDB parallel replication does not solve all problems, far from it!
I agree that better dependency analysis in the scheduler can be usefully combined with optimistic parallel replication, not just the conservative mode.
For example, DDL currently stalls optimistic parallel replication completely, and the user must manually set GTID domain id and control dependencies to replicate DDL in parallel with other transactions. Optimistic also does not do much for non-transactional DML. Improved dependency tracking and scheduling could improve this.
Analysis could also determine events that are very likely to conflict, and this way reduce unnecessary rollback. There is the "optimistic" vs. "aggressive" --slave-parallel-mode distinction to enable/disable such heuristics, but few are currently implemented.
Right, dependency tracking on master is a followup into that direction.
While the feature as you put in comments is largely about optimistic schedulers it still does apply to some corner cases in the conservative one. You mention 'different optimizer decision' which hints to SBR specifics.
Yes, the thd_rpl_deadlock_check() feature was originally implemented to fix various corner cases in conservative parallel replication. I think this mostly/only occurs when commit order on the slave is forced to be the same as on master.
The "different optimiser decision" is the theoretical case where the optimiser chooses to use an index on the master but a table scan on the slave. Or chooses different indexes on master and slave (could also apply to RBR perhaps). In this case different locks can be taken on master and slave and it seems likely that order-conflicts could result.
[ack]
Here how it would go the simple way which must have sense 'cos just statically S2->S3 can't be of common pattern.
['statistically' was meant]
The concrete example was silly, of course, updating the same row twice in a single transaction. But for example foreign key conflicts will be common. Eg.:
BEGIN; INSERT INTO parent_table(key,val) VALUES (10, "hello"); INSERT INTO child_table(key, parent_key, val) VALUES (1, 10, "xizzy"); INSERT INTO child_table(key, parent_key, val) VALUES (2, 10, "red"); INSERT INTO child_table(key, parent_key, val) VALUES (3, 10, "cube"); COMMIT;
where parent_key is a foreign key referencing "key" in parent_table.
The FK dependency type is of some challenge as it applies to RBR (the pipeline's main client). Even though few ideas exist around, initially can be handled with marking the master group to prevent pipelinig so slave.
On the other hand the pulling is volunteer and assuming that the queue always has something the worker becomes busy non stop until prepare at least (perhaps we could involve it into as well). Optimistic version this method would be to leave the prepared modified branch to the hands of the leader and switch to the next group's events.
Right.
In fact, this is a limitation of the existing scheduling algorithm that already has an effect which is seen for multi-source replication and for replication with multiple GTID domain ids. In this case it is possible for the SQL driver thread to completely fill up the worker queues with events from one source / one domain. And the other sources/domains will stall until the first one catches up. Which is not great. And thus we have the work-around with slave_domain_parallel_threads which is rather crude and inflexible.
A more flexible scheduler would also be needed to optimise the thread usage related to wait_for_prior_commit().
Indeed. This also remains in the optimistic pipelining.
Currently this method blocks the worker thread, and thus for long lookahead in optimistic parallel replication, large number of worker threads are needed, most of them idle as you also pointed out. It might be possible to leave the THD in the group commit queue and let the thread switch to work on another transaction meanwhile. This is something I never really looked into, but I think there are similar mechanisms already existing for thread pool operation. Not sure how much can be gained nowadays from avoiding large number of idle threads, but this again would require a better scheduling algorithm.
Or is there some mechanism by which two different threads could run S2 and S3 in parallel, but still within a single transactions (inside InnoDB eg?) - I imagine this is not the case.
We are certainly going to that direction. Let me shy off detailing in already a day long mail :-), anyway it's better off to show a patch that templates an idea.
Right, running a single transaction (or even single statement) on multiple threads is also something interesting to look at - also outside of replication of course, but yes, probably best left for another discussion ;-)
Now I think I'm fully equipped to write it all down on MDEV-16404 so we'll be discussing specific items and stages. Thanks for so far! Andrei
- Kristian.