andrei.elkin@pp.inet.fi writes:
I am yet to check closer thd_rpl_deadlock_check(), perhaps that's why
Oh. I definitely think you should get well acquainted with existing parallel replication in MariaDB to undertake a project as ambitious as this one. There are a number of central concepts and algorithms that synergies together, including (off the top of my head): - binlog group commit, which uses the commit_ordered mechanism to efficiently perform commits for multiple threads and coordinate with storage engines. - wait_for_prior_commit, which works with the binlog group commit to scalably handle dependencies between replication workers. - thd_rpl_deadlock_check(), which works with supporting storage engines to detect conflicts between transactions replicated in parallel. - rpl_parallel_entry and group_commit_orderer, which efficiently coordinates scheduling between worker threads (Jean François' benchmarks show this scaling to 5000 threads). - And behind it all the strict ordering of commits, which underlies many assumptions around parallel replication as well as GTID and allows things like optimistic/aggressive parallel replication in a way that is completely transparent to applications. One thing about pipeline-style slave parallel applier is still not clear to me. How will conflicts between individual statements be handled? Eg. suppose we have this group commit (epoch) on the master: BEGIN; /* T1 */ INSERT INTO t1 VALUES (...); /* S1 */ COMMIT; BEGIN; /* T2 */ UPDATE t1 SET b=10 WHERE a=1; /* S2 */ UPDATE t1 SET b=20 WHERE a=1; /* S3 */ COMMIT; So the idea is to schedule these two transactions (T1 T2) to three different worker threads, right? Each thread gets one of (S1 S2 S3). But we must not commit S3 before S2, obviously, or we end up with incorrect result. So how will this dependency be handled? I was assuming that this would be using the same mechanism as optimistic parallel replication - in fact, I thought the point was that this is precisely a clever generalisation of optimistic parallel replication, realising that transaction boundaries need not be respected. And since epoch does not help with the dependency between S2 and S3, I was confused of the point of using it to detect lack of dependency between T1 and T2... But if you say that you had not considered using the thd_rpl_deadlock_check mechanism, then I am confused as to how to detect that S3 needs to wait for S2?
Assume out of order commit of the grouped original transaction, GTID gaps would exist some little time but as long as there no crashes slave conservative (the group boundaries respective that is) execution would fill them; and in case of crash recovery Slave could specify the gaps at reconnecting (unless those transactions are the relay-log) for retransmitting. They would be logged out-of-order in the log-bin case, and may appearing as set (mathematically). However it's a range as the union of gtids, and it - the master range - can be preserved as described along replication chain.
I don't understand how MariaDB GTID can work unless the slave binlogs the GTIDs in the exact same order as the master. At any time, a lower-level slave may connect, and you have to find the right point at which to start sending it events. There is only a single GTID to start from. Remember, it is not guaranteed that the GTIDs are in strict sequence number ordering from the master! Things like master-master ring or transactions executed directly on the slave can break ordering when gtid_strict_mode is not set. Suppose you have GTIDS 0-1-100,0-1-99 in the log and a lower-level slave connects at GTID 0-1-100. How will you know whether you need to send 0-1-99 to the slave (because of out-of-order master) or if you need to skip it (because we did parallel execution out-of-order)? Or if binlogging is changed so that it uses the slave relay logs instead (identical to master binlogs), but replicated transactions are committed to storage engines out of order. Then how do we remember which parts are committed in storage engines, so we can recover from a crash? In current MariaDB parallel replication, the enforced commit ordering ensures that a single GTID position is enough to recover the slave from a crash, as well as to connect a lower-level slave. It also guarantees to users that they can enable parallel replication without introducing any new read-inconsistencies.
Right, those 40% of rollbacks we are going to trade with extra productivity :-).
But the point is - those 40% rollbacks were beneficial. The main cost of the transaction was the the I/O to read affected pages from disk - so running future transactions in parallel is a win even if we know that they need to be rolled back and retried later. I/O latency is perhaps the most important slave bottleneck to address with parallel replication, and speculative apply can work as a pre-fetcher.
Very true. On one hand we can go blind "speculatively" direct ready to pay rollback price, on the other already Master does detect (some) conflicts, and the actual conflict-free range of transactions is wider than a mere one binlog group (mind a reference to the logical clock that can "measure" dependencies).
Can you btw. point me to a detailed description of how logical clock works? My impression is that this has still limited scope for parallellisation if the master has fast commits, but not knowing the details I may be missing something...
Although the assignment role of the producer is clearly better off be taken by the workers. What I meant in the original mail that the assignment mechanism can be improved to avoid any chance of worker idling but more importantly to facilitate fairness. The workers go to pick pieces of work at once they become available. The optimistic scheduler assignment is close to that but it still assigns only round robin that is speculatively.
I don't understand "the assignment role of the producer is clearly better off be taken by the workers". How would this reduce workers idling? I do not think the current parallel replication scheduler leaves workers idle? On the other hand, if workers are to individually pull work from a shared queue, then they will all be contending for access to the queue (imagine 5000 threads doing this at once...). The idea with the existing scheduler is that this does not happen - each worker has its own queue, so contention is reduced. (Just to be clear, I'm trying to understand here, the current MariaDB scheduler is obviously not perfect).
In fact, if the existing mechanism can be slightly extended, maybe it can already do much of what is needed to ensure consistency. Eg. suppose we have 3 original transactions each with two statements:
e^1_1 e^1_2 e^2_1 e^2_2 e^3_1 e^3_2
Suppose the first 3 of these are ready to commit at some point:
e^1_1 e^1_2 e^2_1
The leader can check that it doesn't try to commit only part of an original transaction.
Probably I will correct your initial perception here (otherwise I am lost in how e^2_1 got separated). In this case you mean three (original) T:s (E) are mapped into a modified
T'_1 = {e^1_1 e^1_2 e^2_1}
and some more e.g just one T'_2 unions of
T'_2 \cup T'_1 = T_1 \cup T_2 \cup T_3 = E
As the leader can't initiate 2p-commit without T'_2.
To illustrate what I mean, take my earlier example, with two transactions T1 and T2 containing individual statements S1, S2, and S3. My understanding is that on the slave, we will start 3 transactions for S1, S2, and S3, each running in its own worker threads. Right? So suppose S1 and S2 finish and are ready to complete, but S3 is still running. Now, we cannot just commit S1 and S2, as that would leave us in an inconsistent state (in case of crash for example). So we should either commit just S1 at this point. Or we could wait for S3 and then group commit all of S1, S2, and S3 together. Right? Or is there some mechanism by which two different threads could run S2 and S3 in parallel, but still within a single transactions (inside InnoDB eg?) - I imagine this is not the case.
Sorry, previously I omitted one important detail. A Worker pulls from the share queue its tail statement not only for itself. When the pull result is a dependent statement whose parent has been or is in processing by another worker, that statement will be redirected to that worker.
Right - but how will you determine whether there is such a dependency? It seems very hard to do in the general case at pull time (where we haven't even parsed the statement yet)?
With some efforts we can make use of relay-log for both recovery and as binlog for further replication. In above I hinted at the latter. So why won't the slave worker "appropriate" the master binlog images? :-)
While the master and slave server version are the same this optimization must be feasible.
Aha, I see. This is something like a built-in binlog server. I agree this seems feasible. Surely it is a separate project from this (of course one may benefit from the other). In a sense, this is the "natural" way to do things. Why go to all the trouble of trying to re-produce the original binlog from executing individual statements, when we already have an (almost) copy directly from the master? Still, this may not be as easy as that... for example, how do we interleave the multiple binlogs from multi-source replication in a way that is consistent with the actual execution order? And what about extra transactions executed directly on the slave? Thanks, - Kristian.