Michael Widenius <monty@askmonty.org> writes:
On the master, I implemented --binlog-commit-wait-count=N and --binlog-commit-wait-usec=T. A transaction will wait at most T microseconds for at least N transactions to queue up and be ready for group commit. This allows to deliberately delay transactions on the master in order to get bigger group commits and thus better opportunity for parallel execution (and again it makes testing easier).
Do you think the above helps in any real world case, except testing?
Assuming we are have N=4 and T=2
- We have 3 threads ready to commit - We wait 2 milliseconds and get 2 more threads. - We commit now 5 threads. During this time you get 3 other threads waiting to commit. - We wait again...
In a scenario without waiting:
- We have 3 threads ready to commit - We commit 3 threads. During this time we get 5 more threads waiting to commit. - We commit 5 threads.
In other words, as the group commit will anyway take up 50 microseconds on a hard disk, and thus automaticly group things together for the next commit, why do we need ever to wait more?
I agree that with --binlog-commit-wait-count you may get less sync calls, but at an expense that a lot of threads took up to T microseconds longer to execute.
The worst case scenario is when you have only one user doing a lot of inserts with auto-commit. In this case using wait will slow down the server with T microseconds for every query.
Have you been able to run any kind of benchmark where using binglog-commit-wait-count will give more performance?
I agree that the options are good for testing. The main question I have is if we want to have the variables in the production server and how we should document when and how a user should use the variables.
I agree that with respect to group commit, the --binlog-commit-wait-count in many cases will not improve performance (that is why I did not implement it earlier). However, for parallel replication things are a bit different. Suppose that the application is doing C transaction commits per second, and that the disk system is capable of F binlog fsyncs per second. Now, if C is significantly bigger than F, then things are as you describe. Generally several transactions will queue up while the previous group commit runs, and there will be sufficient parallelism without using --bicommit_wait-*. This will be typical for eg. a simple harddisk-based system (F=40 commits/second perhaps) where all data is cached in the InnoDB buffer pool (eg. C>500 transactions/second). On the other hand, if C is smaller than F (or of similar magnitide), then usually only few or no new transactions will have time to queue up while the previous transaction is committing. So there will not be much parallelism for parallel replication to exploit without using --binlog-commit-wait-*. This will be typical for eg. a good-quality server with battery-backup RAID controller (F > 1000 commits/second) where data is too big to fit in buffer pool and every update requires disk access to complete (C < 500 transactions/second for example). And in fact, it is the second case, where random I/O is the bottleneck and multiple disk spindels are needed to improve I/O throughput, that single-threaded slave hurts the most, and where increasing --binlog-commit-wait-usec can be used with least penalty. I agree we need to document clearly the risk that --binlog-commit-wait-* will decrease performance on the master. Basically, the user can look at the ratio of Binlog_commits and Binlog_group_commits to check if enough transactions are part of each group commit for parallel replication to be effective.
1. The existing code is not thread-safe for class Relay_log_info. This class contains a bunch of stuff that is specific to executed transactions, not related to relay-log at all. This needs to be moved to the new struct rpl_group_info I introduced, and all code updated to pass around a pointer to that struct instead. There may also be a need to add additional locking on Relay_log_info, existing code needs review for this.
I could take a look working on the above tomorrow and next week.
Ok, great. Ping me when you can so we can coordinate, I may have some partial patches for this lying around. - Kristian.