Re: [Maria-developers] Analysing degraded performance at high concurrency in sysbench OLTP

29 Apr 2014

      Hi Kristian,

On Tue, Apr 29, 2014 at 12:44:22PM +0200, Kristian Nielsen wrote:
...
At the Barcelona meeting in January, I promised to take a look at the
high-concurrency sysbench OLTP benchmarks, and now I finally had the time do
do this.
Thanks for looking at it!
...
There was a lot of work on LOCK_open by Svoj and Serg. If I have understood
correctly, the basic problem was that at high concurrency (like, 512 threads),
the TPS is only a small fraction of the peak throughput at lower concurrency.
Basically, the server "falls over" and starts trashing instead of doing real
work, due to some kind of inter-processor communication overhead.
There are quite a few issues around scalability. The one that I was attempting
to solve was like: MariaDB generates intensive bus traffic when run on different
NUMA nodes. I suppose even 2 threads running on different nodes will be
affected.

It happens due to writes to shared memory location. Especially mutex performing
spin-locks seem to generate a lot of bus traffic.

Subsystem that mostly affect scalability are:
1. THR_LOCK - per-share
2. table cache - now mostly per-share
3. InnoDB
...
I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr
revno:4151 (revid:svoj@mariadb.org-20140415072957-yeir4jvokyilw5hp). I
compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP.
(I just realised that my runs are with 32 tables, while I think the benchmarks
in January focused on single-table runs. Maybe I need to re-do my analysis
with the single-table benchmark, or perhaps it is too artificial to matter
much?).
Yes, the benchmark was focused on single-table runs. Starting with 10.0.10 we
eliminated LOCK_open in favor of per-share mutex. It means single-table runs
scalability issues should remain, but multi-table runs scalability issues
should be solved.
...
In the read-only sysbench, the server mostly does not fall over. I guess this
is due to the work by Svoj on eliminating LOCK_open?
Likely. I would gladly interpret benchmark results if there are any. :)

Since I didn't analyze InnoDB internals wrt scalabilty yet, I'd better stay
away from commenting the rest of e-mail.

Thanks,
Sergey
...
But in read-write, performance drops dramatically at high concurrency. TPS
drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here
are approximate only, they vary somewhat between different runs).
So I analysed the r/w benchmark with the linux `perf` tool. It turns out
two-thirds of the time is spent in a single kernel function _raw_spin_lock():
-  66.26%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock
Digging further using --call-graph, this turns out to be mostly futex waits
(and futex wakeups) from inside InnoDB locking primitives. Calls like
sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in
particular.
So this is related to the non-scalable implementation in InnoDB of locking
primitives, which is a known problem. I think Mark Callaghan has written about
it a couple of times. Last time I looked at the code, every single mutex wait
has to take a global mutex protecting some global arrays and stuff. I even
remember seeing code that at mutex release would pthread_signal_broadcast()
_every_ waiter, all of them waking up, only to all (except one) go do another
wait. This is a kiler for scalability.
While investigating, I discovered the variable innodb_sync_array_size, which I
did not know about. It seems to split the mutex for some of the
synchronisation operations. So I tried to re-run the benchmark with
innodb_sync_array_size set to 8 and 64. In both cases, I got significant
improvement, TPS increase to 5900, twice the value with innodb_sync_array_size
set to the default of 1.
So it is clear that the main limitation in this benchmark was the non-scalable
InnoDB synchronisation implementation. After tuning innodb_sync_array_size,
time spent in _raw_spin_lock() is down to half what it was before (33% of
total time):
+  33.77%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock
Now investigating call-graphs show that the sync_array operations are much
less visible. Instead mutex_create_func(), called from
dict_mem_table_create(), is the one that turns up prominently in the profile.
I am not familiar with what this part of the InnoDB code is doing, but what I
saw from a quick look is that it creates a mutex - and there is another global
mutex needed for this, which again limits scalability.
It is a bit surprising to see mutex creation being the most significant
bottleneck in the benchmark. I would have assumed that most mutexes could be
created up-front and re-used? It is possible that this is a warm-up thing,
maybe the code is filling up the buffer pool or some table-cache like thing
inside InnoDB? Because I see TPS being rather low for the first 150 seconds of
the run (around 3000), and then increasing suddenly to around 8000-9000 for
the rest. This might be worth investigating further.
So in summary, my investigations found that the bottleneck in this benchmark,
and the likely cause of the fall-over, is a scalability problem with InnoDB
locking primitives. The sync_array part seems to be mitigated to some degree
by innodb_sync_array_size, the mutex creation part still needs to be
investigated.
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does
anyone know? I vaguely recall reading something about it, but I am not sure.
It would seem a waste to duplicate their efforts.
In any case, I hope this was useful. As part of this investigation, I
installed a new 3.14 kernel on the lizard2 machine and a new `perf`
installation, which seems to work well to do more detailed investigations of
these kind of issues. So let me know if there are other benchmarks that I
should look into. One thing that could be interesting is to look for false
sharing; there are some performance counters that Intel manuals describe can
be used for this.
As an aside: In my tests, once concurrency becomes high enough that the server
falls over, the actual TPS number becomes mostly meaningless. Eg. I saw
putting dummy pause loops into the code increasing TPS. If TPS stabilises at
N% of peak throughput as concurrency goes to infinity, then we can compare
N. But if N goes to zero as concurrency goes to infinite, I think it is
meaningless to compare actual TPS numbers - we should instead focus on
removing the fall-over behaviour.
(Maybe this is already obvious to you, I have not followed the previous
benchmark efforts that closely).
Hope this helps,
- Kristian.

Re: [Maria-developers] Analysing degraded performance at high concurrency in sysbench OLTP

Sergey Vojtovich