Hi Kristian, On Tue, Apr 29, 2014 at 12:44:22PM +0200, Kristian Nielsen wrote:
At the Barcelona meeting in January, I promised to take a look at the high-concurrency sysbench OLTP benchmarks, and now I finally had the time do do this. Thanks for looking at it!
There was a lot of work on LOCK_open by Svoj and Serg. If I have understood correctly, the basic problem was that at high concurrency (like, 512 threads), the TPS is only a small fraction of the peak throughput at lower concurrency. Basically, the server "falls over" and starts trashing instead of doing real work, due to some kind of inter-processor communication overhead.
There are quite a few issues around scalability. The one that I was attempting to solve was like: MariaDB generates intensive bus traffic when run on different NUMA nodes. I suppose even 2 threads running on different nodes will be affected. It happens due to writes to shared memory location. Especially mutex performing spin-locks seem to generate a lot of bus traffic. Subsystem that mostly affect scalability are: 1. THR_LOCK - per-share 2. table cache - now mostly per-share 3. InnoDB
I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr revno:4151 (revid:svoj@mariadb.org-20140415072957-yeir4jvokyilw5hp). I compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP.
(I just realised that my runs are with 32 tables, while I think the benchmarks in January focused on single-table runs. Maybe I need to re-do my analysis with the single-table benchmark, or perhaps it is too artificial to matter much?).
Yes, the benchmark was focused on single-table runs. Starting with 10.0.10 we eliminated LOCK_open in favor of per-share mutex. It means single-table runs scalability issues should remain, but multi-table runs scalability issues should be solved.
In the read-only sysbench, the server mostly does not fall over. I guess this is due to the work by Svoj on eliminating LOCK_open?
Likely. I would gladly interpret benchmark results if there are any. :) Since I didn't analyze InnoDB internals wrt scalabilty yet, I'd better stay away from commenting the rest of e-mail. Thanks, Sergey
But in read-write, performance drops dramatically at high concurrency. TPS drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here are approximate only, they vary somewhat between different runs).
So I analysed the r/w benchmark with the linux `perf` tool. It turns out two-thirds of the time is spent in a single kernel function _raw_spin_lock():
- 66.26% mysqld [kernel.kallsyms] [k] _raw_spin_lock
Digging further using --call-graph, this turns out to be mostly futex waits (and futex wakeups) from inside InnoDB locking primitives. Calls like sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in particular.
So this is related to the non-scalable implementation in InnoDB of locking primitives, which is a known problem. I think Mark Callaghan has written about it a couple of times. Last time I looked at the code, every single mutex wait has to take a global mutex protecting some global arrays and stuff. I even remember seeing code that at mutex release would pthread_signal_broadcast() _every_ waiter, all of them waking up, only to all (except one) go do another wait. This is a kiler for scalability.
While investigating, I discovered the variable innodb_sync_array_size, which I did not know about. It seems to split the mutex for some of the synchronisation operations. So I tried to re-run the benchmark with innodb_sync_array_size set to 8 and 64. In both cases, I got significant improvement, TPS increase to 5900, twice the value with innodb_sync_array_size set to the default of 1.
So it is clear that the main limitation in this benchmark was the non-scalable InnoDB synchronisation implementation. After tuning innodb_sync_array_size, time spent in _raw_spin_lock() is down to half what it was before (33% of total time):
+ 33.77% mysqld [kernel.kallsyms] [k] _raw_spin_lock
Now investigating call-graphs show that the sync_array operations are much less visible. Instead mutex_create_func(), called from dict_mem_table_create(), is the one that turns up prominently in the profile. I am not familiar with what this part of the InnoDB code is doing, but what I saw from a quick look is that it creates a mutex - and there is another global mutex needed for this, which again limits scalability.
It is a bit surprising to see mutex creation being the most significant bottleneck in the benchmark. I would have assumed that most mutexes could be created up-front and re-used? It is possible that this is a warm-up thing, maybe the code is filling up the buffer pool or some table-cache like thing inside InnoDB? Because I see TPS being rather low for the first 150 seconds of the run (around 3000), and then increasing suddenly to around 8000-9000 for the rest. This might be worth investigating further.
So in summary, my investigations found that the bottleneck in this benchmark, and the likely cause of the fall-over, is a scalability problem with InnoDB locking primitives. The sync_array part seems to be mitigated to some degree by innodb_sync_array_size, the mutex creation part still needs to be investigated.
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure. It would seem a waste to duplicate their efforts.
In any case, I hope this was useful. As part of this investigation, I installed a new 3.14 kernel on the lizard2 machine and a new `perf` installation, which seems to work well to do more detailed investigations of these kind of issues. So let me know if there are other benchmarks that I should look into. One thing that could be interesting is to look for false sharing; there are some performance counters that Intel manuals describe can be used for this.
As an aside: In my tests, once concurrency becomes high enough that the server falls over, the actual TPS number becomes mostly meaningless. Eg. I saw putting dummy pause loops into the code increasing TPS. If TPS stabilises at N% of peak throughput as concurrency goes to infinity, then we can compare N. But if N goes to zero as concurrency goes to infinite, I think it is meaningless to compare actual TPS numbers - we should instead focus on removing the fall-over behaviour.
(Maybe this is already obvious to you, I have not followed the previous benchmark efforts that closely).
Hope this helps,
- Kristian.