Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Axel Schwenke <axel@askmonty.org> writes:
Benchmark 1 is good old sysbench OLTP. I tested 10.0.7 vs. 10.0.7-pgo. With low concurrency there is about 10% win by PGO; however this is completely reversed at higher concurrency by mutex contention (the test was with performance schema disabled, so cannot say which mutex, probably LOCK_open).
Ouch, pgo drops the throughput to 1/2!
That's a pretty serious blow to the whole idea, unless there is not just a fix but also a good explanation. I will investigate this, thanks a lot for testing!
Ok, so I finally got the time to investigate this. I think I understand what is going on. So the original problem was that PGO (profile-guided optimisation) showed a fair improvement at lower concurrency, but a significant reduction in throughput at higher concurrency, in sysbench OLTP. It turns out that the real problem is unrelated to PGO. At higher concurrency, the server code basically falls over, so that adding more concurrent work significantly decreases the throughput. This is a well-known phenomenon. As a side effect, if we improve the code performance of a single thread, we effectively increase the concurrency in the critical spots - threads spend less time executing the real code, hence more time in concurrency bottlenecks. The end result is that _any_ change that improves single-threaded performance causes throughput to decrease at concurrency levels where the code falls over. To verify this, I repeated XL's sysbench runs on a number of different mysqld servers. Apart from XL's original 10.0 and 10.0-pgo, I added a run with _no_ optimisations (-O0), and some runs where I used PGO but deliberately decreased performance by putting a dummy loop into the query execution code. Here are the results for sysbench read-write: Transactions per second in sysbench OLTP read-write (higher is better): 16-rw 128-rw 256-rw 512-rw 10.0-nopgo 6680.84 13004.87 7850.10 4031.06 10.0-pgo 7249.39 12199.32 6336.47 2614.58 10.0-pgo-pause1000 7040.25 12081.80 5825.99 2464.58 10.0-pgo-pause2000 6774.10 12024.44 5810.60 2433.14 10.0-pgo-pause4000 6469.06 12859.23 6479.85 2589.90 10.0-pgo-pause8000 5779.67 13233.35 7074.85 2741.01 10.0-pgo-pause16000 4710.97 12286.62 7896.23 2889.25 10.0-noopt 4004.37 9613.89 7920.67 3268.46 As we see, there is a strong correlation between higher throughput at low concurrency, and lower throughput at high concurrency. As we add more dummy overhead to the PGO server, throughput at the high concurrency increases, and compiling with -O0 is even faster. The sysbench read-only results are similar, though less pronounced as the code now does not fall over so badly (I used recent 10.0 bzr, maybe Svoj's work on LOCK_open has helped solve the problem, or maybe me compiling performance schema out made a difference): Transactions per second in sysbench OLTP read-write (higher is better): 16-ro 128-ro 256-ro 512-ro 10.0-nopgo 8903.62 19034.44 18369.42 15933.65 10.0-pgo 9602.81 20057.09 19084.66 13128.61 10.0-pgo-pause1000 9169.94 20403.00 18814.08 12708.24 10.0-pgo-pause2000 8870.11 20307.68 18618.01 13015.76 10.0-pgo-pause4000 8331.52 19903.76 18425.81 13459.38 10.0-pgo-pause8000 7610.22 18897.86 17650.32 13544.74 10.0-pgo-pause16000 6079.60 16654.55 15853.86 14008.75 10.0-noopt 4969.67 12830.43 12263.99 11438.69 Again, at the concurrency levels where pgo is slower than non-pgo, we can improve throughput by inserting dummy pause loop code. So the conclusion here is that PGO is actually a viable optimisation (and we should do it for the binaries we release, and if possible integrate it into the cmake builds so from-source builds will also benefit from it). The high-concurrency sysbench results are meaningless in terms of single-threaded improvements, as any such improvement ends up degrading TPS, and the real problem needs to be fixed elsewhere, by removing lock contention and so on. I will try next to investigate why the code falls over at high concurrency and see if anything can be done... It also appears that sysbench results at high concurrency are mostly meaningless for comparison between different code versions, unless we can see that one version falls over and the other does not. Hopefully we can find a way to eliminate the catastrophic performance hit at high concurrency... - Kristian.