[Maria-developers] Status on MDEV-4506, parallel replication
So I have been working for some weeks now on implementing MDEV-4506. This task is about making the slave apply (some) events in parallel threads, to speed up replication and reduce the risk of slave not being able to keep up with a busy master. Events are applied in parallel on the slave if they were group-committed together on the master. This is an easy way to detect transactions that are independent. Note that this is transparent to applications; while transactions are executed in parallel on the slave, they are still committed in the same order as on the master. I also added parallel execution of events with different GTID domain id. This makes testing a lot easier (no need to carefully arrange timing to get a specific group commit on the master), and also really is the whole point of much of my hard work on GTID. So if we have multi-source M1->S1, M2->S1, and S1->S2, then S2 will be able to execute events from M1 in parallel with those from M2, just like S1 can. And the user can explicitly set the domain_id different for eg. a long-running ALTER or UPDATE, and this way get it to run in parallel and not cause a huge replication delay. ---- On the master, I added to each GTID event a commit_id. If two transactions group-commit together, they are binlogged with the same commit_id; if not, they get different commit_ids. Thus, the slave can detect the possibility of executing two transactions in parallel by checking if the commit_ids are equal. On the master, I implemented --binlog-commit-wait-count=N and --binlog-commit-wait-usec=T. A transaction will wait at most T microseconds for at least N transactions to queue up and be ready for group commit. This allows to deliberately delay transactions on the master in order to get bigger group commits and thus better opportunity for parallel execution (and again it makes testing easier). On the slave, I implemented --slave-parallel-threads=N. If N>0, that many threads will be spawned, and events will be executed in parallel (if possible) by those threads. ---- The current code is pushed here (it is based on 10.0-base): lp:~maria-captains/maria/10.0-knielsen It is still far from finished, but it now works sufficiently that I could do some basic benchmarking (just on my laptop, completely unscientifically). First, I prepared a binlog on the master with plenty of opportunity for parallel replication. I started the master with --binlog-commit-wait-count=20 --binlog-commit-wait-usec=1000000. I then ran this Gypsy script: ----------------------------------------------------------------------- i|1|DROP TABLE IF EXISTS t1 i|1|CREATE TABLE t1 (a INT PRIMARY KEY, b VARCHAR(100)) ENGINE=InnoDB p|1|REPLACE INTO t1 (a,b) VALUES (? MOD 10000, ?)|int,varchar /home/knielsen/my/gypsy/words ----------------------------------------------------------------------- gypsy --queryfile=simple_replace_load.gypsy --duration=20 --threads=40 This results in a binlog with about 65k updates to the table, group-committed in batches of 20 transactions. I then started a fresh slave and let it replicate everything with START SLAVE UNTIL; the time to replicate all the events is then easy to see in the slave error log. The time to replicate everything with unmodified 10.0-base was 99 seconds. With --slave-parallel-threads=25, it was just 22 seconds. So that is a 4.5 times speedup, which is quite promising. Also note that at 22 seconds, the slave is within 10% of the speed of the master. ---- But as I said, there is still significant work left to do. I put a ToDo list at the top of sql/rpl_parallel.cc. Some of the big remaining issues: 1. The existing code is not thread-safe for class Relay_log_info. This class contains a bunch of stuff that is specific to executed transactions, not related to relay-log at all. This needs to be moved to the new struct rpl_group_info I introduced, and all code updated to pass around a pointer to that struct instead. There may also be a need to add additional locking on Relay_log_info, existing code needs review for this. 2. Error handling needs to be implemented, it is rather more complex in the parallel case. If one transaction fails (and retry also fails), then we need to somehow get hold of all later transactions that are in the process of parallel execution, and abort them + roll them back. Otherwise we get inconsistent binlog position for the next slave restart. 3. In the old code, when the SQL thread is stopped, it has logic to let the current event group (=transaction) replicate to completeness first, with a timeout to force a stop in the middle of the event group if eg. the master has disappeared. This logic needs to be re-implemented to work when having any number of event groups executing in parallel. (It is important to let the groups complete execution when doing non-transactional stuff that cannot be rolled back, otherwise again the slave position becomes inconsistent for next slave restart). So as you see, there is quite a bit of work left on this (as well as on GTID). So I would very much welcome any help on this to avoid causing delays for 10.0-GA ... - Kristian.
participants (1)
-
Kristian Nielsen