[Maria-developers] Analysing degraded performance at high concurrency in sysbench OLTP
At the Barcelona meeting in January, I promised to take a look at the high-concurrency sysbench OLTP benchmarks, and now I finally had the time do do this. There was a lot of work on LOCK_open by Svoj and Serg. If I have understood correctly, the basic problem was that at high concurrency (like, 512 threads), the TPS is only a small fraction of the peak throughput at lower concurrency. Basically, the server "falls over" and starts trashing instead of doing real work, due to some kind of inter-processor communication overhead. I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr revno:4151 (revid:svoj@mariadb.org-20140415072957-yeir4jvokyilw5hp). I compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP. (I just realised that my runs are with 32 tables, while I think the benchmarks in January focused on single-table runs. Maybe I need to re-do my analysis with the single-table benchmark, or perhaps it is too artificial to matter much?). In the read-only sysbench, the server mostly does not fall over. I guess this is due to the work by Svoj on eliminating LOCK_open? But in read-write, performance drops dramatically at high concurrency. TPS drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here are approximate only, they vary somewhat between different runs). So I analysed the r/w benchmark with the linux `perf` tool. It turns out two-thirds of the time is spent in a single kernel function _raw_spin_lock(): - 66.26% mysqld [kernel.kallsyms] [k] _raw_spin_lock Digging further using --call-graph, this turns out to be mostly futex waits (and futex wakeups) from inside InnoDB locking primitives. Calls like sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in particular. So this is related to the non-scalable implementation in InnoDB of locking primitives, which is a known problem. I think Mark Callaghan has written about it a couple of times. Last time I looked at the code, every single mutex wait has to take a global mutex protecting some global arrays and stuff. I even remember seeing code that at mutex release would pthread_signal_broadcast() _every_ waiter, all of them waking up, only to all (except one) go do another wait. This is a kiler for scalability. While investigating, I discovered the variable innodb_sync_array_size, which I did not know about. It seems to split the mutex for some of the synchronisation operations. So I tried to re-run the benchmark with innodb_sync_array_size set to 8 and 64. In both cases, I got significant improvement, TPS increase to 5900, twice the value with innodb_sync_array_size set to the default of 1. So it is clear that the main limitation in this benchmark was the non-scalable InnoDB synchronisation implementation. After tuning innodb_sync_array_size, time spent in _raw_spin_lock() is down to half what it was before (33% of total time): + 33.77% mysqld [kernel.kallsyms] [k] _raw_spin_lock Now investigating call-graphs show that the sync_array operations are much less visible. Instead mutex_create_func(), called from dict_mem_table_create(), is the one that turns up prominently in the profile. I am not familiar with what this part of the InnoDB code is doing, but what I saw from a quick look is that it creates a mutex - and there is another global mutex needed for this, which again limits scalability. It is a bit surprising to see mutex creation being the most significant bottleneck in the benchmark. I would have assumed that most mutexes could be created up-front and re-used? It is possible that this is a warm-up thing, maybe the code is filling up the buffer pool or some table-cache like thing inside InnoDB? Because I see TPS being rather low for the first 150 seconds of the run (around 3000), and then increasing suddenly to around 8000-9000 for the rest. This might be worth investigating further. So in summary, my investigations found that the bottleneck in this benchmark, and the likely cause of the fall-over, is a scalability problem with InnoDB locking primitives. The sync_array part seems to be mitigated to some degree by innodb_sync_array_size, the mutex creation part still needs to be investigated. I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure. It would seem a waste to duplicate their efforts. In any case, I hope this was useful. As part of this investigation, I installed a new 3.14 kernel on the lizard2 machine and a new `perf` installation, which seems to work well to do more detailed investigations of these kind of issues. So let me know if there are other benchmarks that I should look into. One thing that could be interesting is to look for false sharing; there are some performance counters that Intel manuals describe can be used for this. As an aside: In my tests, once concurrency becomes high enough that the server falls over, the actual TPS number becomes mostly meaningless. Eg. I saw putting dummy pause loops into the code increasing TPS. If TPS stabilises at N% of peak throughput as concurrency goes to infinity, then we can compare N. But if N goes to zero as concurrency goes to infinite, I think it is meaningless to compare actual TPS numbers - we should instead focus on removing the fall-over behaviour. (Maybe this is already obvious to you, I have not followed the previous benchmark efforts that closely). Hope this helps, - Kristian.
Hi Kristian, On Tue, Apr 29, 2014 at 12:44:22PM +0200, Kristian Nielsen wrote:
At the Barcelona meeting in January, I promised to take a look at the high-concurrency sysbench OLTP benchmarks, and now I finally had the time do do this. Thanks for looking at it!
There was a lot of work on LOCK_open by Svoj and Serg. If I have understood correctly, the basic problem was that at high concurrency (like, 512 threads), the TPS is only a small fraction of the peak throughput at lower concurrency. Basically, the server "falls over" and starts trashing instead of doing real work, due to some kind of inter-processor communication overhead.
There are quite a few issues around scalability. The one that I was attempting to solve was like: MariaDB generates intensive bus traffic when run on different NUMA nodes. I suppose even 2 threads running on different nodes will be affected. It happens due to writes to shared memory location. Especially mutex performing spin-locks seem to generate a lot of bus traffic. Subsystem that mostly affect scalability are: 1. THR_LOCK - per-share 2. table cache - now mostly per-share 3. InnoDB
I started from Axel' OLTP sysbench runs and scripts, using 10.0 from bzr revno:4151 (revid:svoj@mariadb.org-20140415072957-yeir4jvokyilw5hp). I compiled without performance schema and with PGO, and ran sysbench 0.5 OLTP.
(I just realised that my runs are with 32 tables, while I think the benchmarks in January focused on single-table runs. Maybe I need to re-do my analysis with the single-table benchmark, or perhaps it is too artificial to matter much?).
Yes, the benchmark was focused on single-table runs. Starting with 10.0.10 we eliminated LOCK_open in favor of per-share mutex. It means single-table runs scalability issues should remain, but multi-table runs scalability issues should be solved.
In the read-only sysbench, the server mostly does not fall over. I guess this is due to the work by Svoj on eliminating LOCK_open?
Likely. I would gladly interpret benchmark results if there are any. :) Since I didn't analyze InnoDB internals wrt scalabilty yet, I'd better stay away from commenting the rest of e-mail. Thanks, Sergey
But in read-write, performance drops dramatically at high concurrency. TPS drops to 2600 at 512 threads compared to a peak of around 13000 (numbers here are approximate only, they vary somewhat between different runs).
So I analysed the r/w benchmark with the linux `perf` tool. It turns out two-thirds of the time is spent in a single kernel function _raw_spin_lock():
- 66.26% mysqld [kernel.kallsyms] [k] _raw_spin_lock
Digging further using --call-graph, this turns out to be mostly futex waits (and futex wakeups) from inside InnoDB locking primitives. Calls like sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in particular.
So this is related to the non-scalable implementation in InnoDB of locking primitives, which is a known problem. I think Mark Callaghan has written about it a couple of times. Last time I looked at the code, every single mutex wait has to take a global mutex protecting some global arrays and stuff. I even remember seeing code that at mutex release would pthread_signal_broadcast() _every_ waiter, all of them waking up, only to all (except one) go do another wait. This is a kiler for scalability.
While investigating, I discovered the variable innodb_sync_array_size, which I did not know about. It seems to split the mutex for some of the synchronisation operations. So I tried to re-run the benchmark with innodb_sync_array_size set to 8 and 64. In both cases, I got significant improvement, TPS increase to 5900, twice the value with innodb_sync_array_size set to the default of 1.
So it is clear that the main limitation in this benchmark was the non-scalable InnoDB synchronisation implementation. After tuning innodb_sync_array_size, time spent in _raw_spin_lock() is down to half what it was before (33% of total time):
+ 33.77% mysqld [kernel.kallsyms] [k] _raw_spin_lock
Now investigating call-graphs show that the sync_array operations are much less visible. Instead mutex_create_func(), called from dict_mem_table_create(), is the one that turns up prominently in the profile. I am not familiar with what this part of the InnoDB code is doing, but what I saw from a quick look is that it creates a mutex - and there is another global mutex needed for this, which again limits scalability.
It is a bit surprising to see mutex creation being the most significant bottleneck in the benchmark. I would have assumed that most mutexes could be created up-front and re-used? It is possible that this is a warm-up thing, maybe the code is filling up the buffer pool or some table-cache like thing inside InnoDB? Because I see TPS being rather low for the first 150 seconds of the run (around 3000), and then increasing suddenly to around 8000-9000 for the rest. This might be worth investigating further.
So in summary, my investigations found that the bottleneck in this benchmark, and the likely cause of the fall-over, is a scalability problem with InnoDB locking primitives. The sync_array part seems to be mitigated to some degree by innodb_sync_array_size, the mutex creation part still needs to be investigated.
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure. It would seem a waste to duplicate their efforts.
In any case, I hope this was useful. As part of this investigation, I installed a new 3.14 kernel on the lizard2 machine and a new `perf` installation, which seems to work well to do more detailed investigations of these kind of issues. So let me know if there are other benchmarks that I should look into. One thing that could be interesting is to look for false sharing; there are some performance counters that Intel manuals describe can be used for this.
As an aside: In my tests, once concurrency becomes high enough that the server falls over, the actual TPS number becomes mostly meaningless. Eg. I saw putting dummy pause loops into the code increasing TPS. If TPS stabilises at N% of peak throughput as concurrency goes to infinity, then we can compare N. But if N goes to zero as concurrency goes to infinite, I think it is meaningless to compare actual TPS numbers - we should instead focus on removing the fall-over behaviour.
(Maybe this is already obvious to you, I have not followed the previous benchmark efforts that closely).
Hope this helps,
- Kristian.
Kristian - Did you test InnoDB or XtraDB?
Digging further using --call-graph, this turns out to be mostly futex waits (and futex wakeups) from inside InnoDB locking primitives. Calls like sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in particular.
Interestingly I don't recall it being it a top issue in our benchmarks (although I was not the one running them, so I could be forgetting some details), and we did test high concurrency setups. It is possible we worked around by innodb_sync_array_size and the spinning-related option tuning.
So this is related to the non-scalable implementation in InnoDB of locking primitives, which is a known problem. I think Mark Callaghan has written about it a couple of times. Last time I looked at the code, every single mutex wait has to take a global mutex protecting some global arrays and stuff.
The affected waits are those that go to wait on events in the sync array(s). No global mutex is used if locking is completed through spinning.
I even remember seeing code that at mutex release would pthread_signal_broadcast() _every_ waiter, all of them waking up, only to all (except one) go do another wait. This is a kiler for scalability.
We have implemented priority mutex/rwlocks in XtraDB for a different issue, but it indirectly helps here: allow high priority waiters waiting on their own designated event. When the mutex/rwlock is released, signal high-priority waiters only, There are much fewer higher priority waiter threads than regular ones.
Now investigating call-graphs show that the sync_array operations are much less visible. Instead mutex_create_func(), called from dict_mem_table_create(), is the one that turns up prominently in the profile. I am not familiar with what this part of the InnoDB code is doing, but what I saw from a quick look is that it creates a mutex - and there is another global mutex needed for this, which again limits scalability.
It is a bit surprising to see mutex creation being the most significant bottleneck in the benchmark. I would have assumed that most mutexes could be created up-front and re-used? It is possible that this is a warm-up thing, maybe the code is filling up the buffer pool or some table-cache like thing inside InnoDB? Because I see TPS being rather low for the first 150 seconds of the run (around 3000), and then increasing suddenly to around 8000-9000 for the rest. This might be worth investigating further.
dict_mem_table_create() creating mutexes and rwlocks all the time is a known issue: http://bugs.mysql.com/bug.php?id=71708. It was here forever, made worse in Oracle 5.6.16, fully fixed in Percona 5.6.16. Oracle should have a partial fix in 5.6.19 and full in 5.7.
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure.
5.7 allows different mutex implementations to co-exist, and there is a new implementation that uses futexes. The sync array implementation is still there too. The code pushed so far seems to focus on getting the framework right and adding implementations more than on performance. I'd expect that to change in the later pushes.
It would seem a waste to duplicate their efforts.
There are Percona's efforts too ;) -- Laurynas
Laurynas Biveinis <laurynas.biveinis@gmail.com> writes:
Did you test InnoDB or XtraDB?
It should be XtraDB, which is the default in MariaDB 10 now. This is the version info from univ.i: #define INNODB_VERSION_MAJOR 5 #define INNODB_VERSION_MINOR 6 #define INNODB_VERSION_BUGFIX 15 #ifndef PERCONA_INNODB_VERSION #define PERCONA_INNODB_VERSION 63.0 #endif
The affected waits are those that go to wait on events in the sync array(s). No global mutex is used if locking is completed through spinning.
We have implemented priority mutex/rwlocks in XtraDB for a different issue, but it indirectly helps here: allow high priority waiters waiting on their own designated event. When the mutex/rwlock is released, signal high-priority waiters only, There are much fewer higher priority waiter threads than regular ones.
We also have the innodb_thread_concurrency. All of these seem to be work-arounds for the fundamental problem that InnoDB locking primitives are fundamentally non-scalable. The global mutex on the sync arrays is bad enough, but wake-all is a real killer, as it creates O(N**2) cost of having N threads waiting on the same lock. But of course, this is the view of an outsider. I appreciate that the issue is much more complex once one gets down to the real code. The locking primitives are the very core of a complex legacy codebase. And the InnoDB locking primitives provide a lot of status information to the DBA that would not be available from a simple pthread_mutex_t.
dict_mem_table_create() creating mutexes and rwlocks all the time is a known issue: http://bugs.mysql.com/bug.php?id=71708. It was here forever, made worse in Oracle 5.6.16, fully fixed in Percona 5.6.16. Oracle should have a partial fix in 5.6.19 and full in 5.7.
Ah, nice, thanks for the pointer!
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure.
5.7 allows different mutex implementations to co-exist, and there is a new implementation that uses futexes. The sync array implementation is still there too. The code pushed so far seems to focus on getting the framework right and adding implementations more than on performance. I'd expect that to change in the later pushes.
Ok, so that sounds promising.
It would seem a waste to duplicate their efforts.
There are Percona's efforts too ;)
Indeed! I wouldn't want to duplicate any of that effort, though with Percona's effort it's a lot easier due to better communication. It seems that you great people at Percona have a good handle on the InnoDB issues, together with whatever the Oracle InnoDB team might come up with, so it makes sense for me to focus on other stuff. Though it still makes me cry to look at that sync array code in InnoDB... Thanks, - Kristian.
Laurynas - I am surprised they are still iterating on http://bugs.mysql.com/bug.php?id=71708. Assuming another big change is needed I think that will be version 4 for that code. On Wed, Apr 30, 2014 at 5:59 AM, Kristian Nielsen <knielsen@knielsen-hq.org>wrote:
Laurynas Biveinis <laurynas.biveinis@gmail.com> writes:
Did you test InnoDB or XtraDB?
It should be XtraDB, which is the default in MariaDB 10 now.
This is the version info from univ.i:
#define INNODB_VERSION_MAJOR 5 #define INNODB_VERSION_MINOR 6 #define INNODB_VERSION_BUGFIX 15
#ifndef PERCONA_INNODB_VERSION #define PERCONA_INNODB_VERSION 63.0 #endif
The affected waits are those that go to wait on events in the sync array(s). No global mutex is used if locking is completed through spinning.
We have implemented priority mutex/rwlocks in XtraDB for a different issue, but it indirectly helps here: allow high priority waiters waiting on their own designated event. When the mutex/rwlock is released, signal high-priority waiters only, There are much fewer higher priority waiter threads than regular ones.
We also have the innodb_thread_concurrency. All of these seem to be work-arounds for the fundamental problem that InnoDB locking primitives are fundamentally non-scalable. The global mutex on the sync arrays is bad enough, but wake-all is a real killer, as it creates O(N**2) cost of having N threads waiting on the same lock.
But of course, this is the view of an outsider. I appreciate that the issue is much more complex once one gets down to the real code. The locking primitives are the very core of a complex legacy codebase. And the InnoDB locking primitives provide a lot of status information to the DBA that would not be available from a simple pthread_mutex_t.
dict_mem_table_create() creating mutexes and rwlocks all the time is a known issue: http://bugs.mysql.com/bug.php?id=71708. It was here forever, made worse in Oracle 5.6.16, fully fixed in Percona 5.6.16. Oracle should have a partial fix in 5.6.19 and full in 5.7.
Ah, nice, thanks for the pointer!
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure.
5.7 allows different mutex implementations to co-exist, and there is a new implementation that uses futexes. The sync array implementation is still there too. The code pushed so far seems to focus on getting the framework right and adding implementations more than on performance. I'd expect that to change in the later pushes.
Ok, so that sounds promising.
It would seem a waste to duplicate their efforts.
There are Percona's efforts too ;)
Indeed! I wouldn't want to duplicate any of that effort, though with Percona's effort it's a lot easier due to better communication.
It seems that you great people at Percona have a good handle on the InnoDB issues, together with whatever the Oracle InnoDB team might come up with, so it makes sense for me to focus on other stuff. Though it still makes me cry to look at that sync array code in InnoDB...
Thanks,
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com
participants (4)
-
Kristian Nielsen
-
Laurynas Biveinis
-
MARK CALLAGHAN
-
Sergey Vojtovich