----------------------------------------------------------------------- WORKLOG TASK -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- TASK...........: Parallel replication of group-committed transactions CREATION DATE..: Fri, 18 Mar 2011, 14:06 SUPERVISOR.....: Sergei IMPLEMENTOR....: Knielsen COPIES TO......: CATEGORY.......: Server-RawIdeaBin TASK ID........: 184 (http://askmonty.org/worklog/?tid=184) VERSION........: Server-9.x STATUS.........: Un-Assigned PRIORITY.......: 60 WORKED HOURS...: 0 ESTIMATE.......: 0 (hours remain) ORIG. ESTIMATE.: 0 PROGRESS NOTES: DESCRIPTION: Overview -------- Parallel replication is the idea of executing multiple transactions (or more precisely event groups from the binlog) in parallel in multiple threads on the slave, in order to make the slave applier more scalable to multiple CPU cores and disk spindles. The main problem with parallel replication is identifying transactions that are *independent*, meaning that they can be safely executed in parallel on the slave without dead-locking each other or producing different results than on the master. This worklog describes one way of identifying such independence. MWL#169 describes another. Apart from discovering independent transactions, it is also necessary to implement a mechanism for actually distributing events to different execution threads, coordinating commit order, handling errors and crash recovery, etc. It is expected that different ways of determining independent transactions, such as this worklog and MWL#169, will share the same mechanism for actual parallel execution. It should even be possible to combine different ways to discover independence at the same time. Idea ---- MWL#116 introduces group commit, where multiple transactions are committed together and also written to the binlog together. This works by collecting parallel transactions that are waiting inside COMMIT (or autocommit) for access to the binary log. The idea in this worklog is that all transactions participating in a single group commit must necessarily be independent, and can be safely applied in parallel on a slave. Suppose we are about to write N transactions T_1, ..., T_N into the binlog in group commit. Then all statements for each transaction have completed, but none of them are committed yet. This means that (1) there can be no conflicting row locks between any two T_i and T_j, and (2) no transaction T_i was able to see any changes from T_j (for i<>j and assuming isolation level greater than read uncommitted). Another way to see this is to note that on the master, the order of T_1, ..., T_N is essentially random and depends only on how the kernel decided to schedule the threads during the COMMIT. Any ordering would have been possible just by having different thread scheduling delays when the transactions are put in the ordered list used for group commit; there is no inter-thread dependencies at that point in the code. So any order of the transactions in the binlog would have been possible, and thus any order of slave execution. Therefore, as long as COMMIT cannot change what modifications the transaction already did, any execution order of the transactions on the slave is permissible. Implementation -------------- I suggest that when this feature is enabled, we introduce two new binlog events, group_commit_begin and group_commit_end. We write a group_commit_begin event just before all the events for the transactions in a group commit, and a group_commit_end event just after. We can use the "ignorable events" from MySQL 5.5 for these events; this will allow a slave that does not implement these events to safely ignore them. If needed, we can further increase compatibility by - Having a server option to enable/disable the writing of these two events. - Add a flag that the slave sets to request the sending of these events, and only send them if requested (similar to MWL#47). It still needs to be decided how much of this to do. The slave can then collect all the event groups between the group_commit_begin and group_commit_end pair, and schedule them in parallel. Limitations ----------- The main limitation of this approach is that on the slave, it can only run transactions in parallel if they happened to commit at roughly the same time on the master. This means that there will be little opportunity for parallelism on the slave if 1. The master has low write load 2. The master is mainly running few but long transactions 3. The master has fast fsync() performance for the commit In the case (1), there is in any case less need for scaling the event execution. In the case (2), another method (for example MWL#169) will be needed to achieve efficient parallel replication. For (3), we can improve parallelism by deliberately delaying commits on the master. By inserting a short sleep just before group commit, we allow more independent transactions time to participate, giving more opportunity for parallelism on the slave. At the same time, we reduce fsync() load on the master, potentially freeing more I/O capacity for other work. For example, we might add an option to rate-limit COMMIT fsync() to 25 per seconds or whatever. Care will be needed when using this option, as if the master load has insufficient parallelism, such rate-limitation will reduce throughput on the master. The MWL#163 innodb_release_locks_early option causes transactions to effectively be committed early in COMMIT during the prepare phase. This allows another transaction to see and/or modify rows also modified by the first transaction, and still participate in the same group commit. This means that when innodb_release_locks_early is set, the transactions cannot be safely used for parallel replication (eg. in this case we must disable the writing of group_commit_begin / group_commit_end). Events that modify non-transactional tables or do DDL causes more complications, as they are written directly to the binlog and make changes visible immediately to other transactions (basically they are not transactional). To simplify things, we will not declare any such events independent with respect to any other transactions according to this worklog. There are many useful applications of parallel replications where such events are rare. ESTIMATED WORK TIME ESTIMATED COMPLETION DATE ----------------------------------------------------------------------- WorkLog (v4.0.0)