----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Parallel replication using interleaved binlog events to detect potential parallelism CREATION DATE..: Mon, 21 Mar 2011, 10:45 SUPERVISOR.....: IMPLEMENTOR....: COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 186 (http://askmonty.org/worklog/?tid=186) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Similar to MWL#169 and MWL#184, this worklog describes a way to discover binlog events that can be executed in parallel on the slave, improving scalability with increasing CPU and I/O resources. This worklog only deals with parallel replication of row-based events. Statement-based events should still be executed in serial order. We also require that commits on the slave still happen in the same order as on the master (this guarantees that applications can never see a state on a slave that did not exist on the master, among other nice properties). Basic idea ---------- The idea is to work with row-based events. Suppose that on the master, we have two parallel transactions running, T_1 and T_2. Suppose that we have the following sequence of events: 1. T_2 modifies a row R_1, resulting in row event E_1 2. T_1 commits 3. T_2 commits. In this case, it is safe on a slave to apply the event E_1 of T_2 in parallel with any events of transaction T_1. Since we know T_1 committed before T_2, we know that T_1 did not have to wait for T_2 to release any row lock of R_1 in commit, so T_1 could not have modified R_1 after T_2 did. Similarly, since we know that T_2 executed E_1 before T_1 committed, we know E_1 did not have to wait for T_1 to release any row locks on R_1, so T_1 could not have modified R_1 before T_2 did. Thus, we know that T_1 and T_2 are not in conflict on this particular row R_1 and row event E_1. Thus, on a slave, with this knowledge we can start transaction T_2 and execute E_1 (in a separate thread from T_1 events) while T_1 is still running. Refinement: optimistic apply ---------------------------- Suppose now that instead we have the following sequence of events: 1. T_3 commits 2. T_4 modifies a row R_2, resulting in row event E_2 3. T_4 commits In this case, we can not be sure that event E_2 is independent of transaction T_3. It could well be that T_3 also modified R_2, and E_2 was waiting for T_3 to commit and release row locks. However, on a slave, we can nevertheless try to apply E_2 (in a separate thread) while T_3 is still running. However, then we have to deal with the situation that T_3 gets stuck waiting for T_4 to commit and release the lock on R_2. In this situation, we have a deadlock; T3 is waiting for T_4, but T_4 also needs to wait for T_3 (as we want to commit in the right order, and even more so want the end result to have the changes of T_4 in row R_2, not those of T_3). So we need to abort and restart one of the transactions. As long as we choose always to abort the transaction with the later commit time (T_4 in this case), things will work out: We will abort T_4, commit T_3, then re-start and commit T_4. Using optimistic apply has the potential to increase parallelism on the slave. However, this must be carefully weighted against the extra cost of having to roll back and retry some transactions. So it needs suitable status variables to be able to monitor number of transactions and number of events that were rolled back and re-started (as well as number of transactions and events successfully applied). And user options to control if and to what extend to try optimistic apply. But note that since we always have at least one transaction running that will not be aborted due to optimistic apply (the one with earliest commit), and since all other transactions run in parallel, the cost of restart may not be too bad, as long as we do not start to starve the CPU(s). Any extra I/O reads done in the aborted transaction to bring in pages to the buffer pool will in any case benefit us when we immediately re-start the transaction. Requirements ------------ Implementing this worklog will require several non-trivial extensions of the server and storage engine code (these should eventually be filed as separate worklog items): 1. Ability for server layer to detect and control what to do in case of conflicts between two transactions. Currently, the storage engine (eg. InnoDB/XtraDB) will just let the latter access block waiting for the first transaction to commit, and in the case of a deadlock the storage engine will choose an arbitrary transaction to roll back. For this worklog, the server layer (replication slave SQL thread(s)) must be able to (a) cause abort of the former transaction rather than wait of the second, and (b) choose which of several possible transactions to abort in case of deadlock. Note that such a mechanism is also needed for Galera replication, and the Galera wsrep patch implements something like this for InnoDB already. 2. Interleaving of events from different transactions in the binlog. Currently, the master buffers events during a transaction and writes them all out sequentially during COMMIT; this destroys any knowledge about which events and transactions ran in parallel on the master. We need to change this so that events are tagged with a (local) transaction ID and written out to the binlog as they occur, interleaved with events from other transactions. We will probably also want to add "begin transaction" events and "prepare" event, to increase the knowledge available on the slave about in which order events occured. We will still want to do _some_ buffering of events to avoid heavy contention on writing to the binlog. We can buffer a small number of normal events; The cost of such buffering is we may write a row event E_r _after_ a commit event E_c where it could have been written before, which loses some opportunity for (non-optimistic) parallel apply on the slave. We may however _not_ buffer commit events; such buffering could wrongly put E_r ahead of E_c where it must be after, causing an unexpected conflict in parallel apply on the slave. 3. A mechanism on the slave for distributing binlog events among multiple threads and coordinating execution and commit among the threads. Such mechanism should ideally be the same as used in MWL#184 and MWL#169. Discussion ---------- One nice property of this worklog is that the parallelism exposed to the slave is derived from whatever parallelism was there already on the master. We know E_1 in T_2 ran in parallel with T_1 on the master, and use this to similarly run E_1 in parallel with T_1 on the slave. This increases the likelyhood that if parallelism is available on the master, it will also be available on the slave (note that we will need parallelism on both master and slave to be able to scale an entire replication topology). ---- For short OLTP-like transactions, it seems likely that this method will be particularly suited to be combined with the method from MWL#184, which allows to run in parallel on the slave all transactions that participate in a single group commit on the master. Eg. for single-row transactions T_1 (with row event E_1) and T_2 (with row event E_2) that run in parallel on the master, we can have the following two situations (among others): 1. E_1 and E_2 execute start 1. E_1 and E_2 execute start 2. E_1 execute finish 2. E_1 execute finish 3. E_2 execute finish 3. T_1 commit 4. T_1 commit 4. E_2 execute finish 5. T_2 commit 5. T_2 commit In the scenario to the left, on a busy master the two commits are likely to be done as a single group commit, and MWL#184 will allow parallelism on the slave. In the scenario on the right, optimistic apply as per this worklog will allow to run T_1 and T_2 in parallel on the slave. ---- Note that by interleaving events from different transactions in the binlog, we will be writing events before they are committed, and so will need to issue ROLLBACK statements in case of rollback. Similarly, on the slave, we can start applying the events from a transaction even before we have received the commit event from the master (in fact even before such commit event has even occured on the master). This can potentially reduce master->slave delay. However, there is a need to be careful, as on the slave we need to keep the correct commit order. And such order is not known until the commit events are received. For example, if we had a limited number N of slave execution threads and they all have started running T_1, ..., T_N, and we then receive a commit event for T_(N+1), we would be deadlocked. It might be a good idea to only start executing new transactions when we have at least one pending commit in the relay log, or at least keep one spare thread available for execution of one more transaction once we receive such commit event. Similarly, if we get a conflict in opportunistic apply of transactions T_1 and T_2, neither of which we have received the commit event from, then we will need to wait for either commit event to arrive before we can decide which of T_1 and T_2 to abort. Note that in any case what we need to do is no worse than what we do in currentl MySQL replication. There, we never do any execution until we have not only seen but also executed any prior commit. Similarly, whatever buffering of events we need to do on the parallel slave will be no worse than what we currently have to do on the master (where we buffer _all_ events until commit). ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v4.0.0)