Kristian - Did you test InnoDB or XtraDB?
Digging further using --call-graph, this turns out to be mostly futex waits (and futex wakeups) from inside InnoDB locking primitives. Calls like sync_array_get_and_reserve_cell() and sync_array_wait_event() stand out in particular.
Interestingly I don't recall it being it a top issue in our benchmarks (although I was not the one running them, so I could be forgetting some details), and we did test high concurrency setups. It is possible we worked around by innodb_sync_array_size and the spinning-related option tuning.
So this is related to the non-scalable implementation in InnoDB of locking primitives, which is a known problem. I think Mark Callaghan has written about it a couple of times. Last time I looked at the code, every single mutex wait has to take a global mutex protecting some global arrays and stuff.
The affected waits are those that go to wait on events in the sync array(s). No global mutex is used if locking is completed through spinning.
I even remember seeing code that at mutex release would pthread_signal_broadcast() _every_ waiter, all of them waking up, only to all (except one) go do another wait. This is a kiler for scalability.
We have implemented priority mutex/rwlocks in XtraDB for a different issue, but it indirectly helps here: allow high priority waiters waiting on their own designated event. When the mutex/rwlock is released, signal high-priority waiters only, There are much fewer higher priority waiter threads than regular ones.
Now investigating call-graphs show that the sync_array operations are much less visible. Instead mutex_create_func(), called from dict_mem_table_create(), is the one that turns up prominently in the profile. I am not familiar with what this part of the InnoDB code is doing, but what I saw from a quick look is that it creates a mutex - and there is another global mutex needed for this, which again limits scalability.
It is a bit surprising to see mutex creation being the most significant bottleneck in the benchmark. I would have assumed that most mutexes could be created up-front and re-used? It is possible that this is a warm-up thing, maybe the code is filling up the buffer pool or some table-cache like thing inside InnoDB? Because I see TPS being rather low for the first 150 seconds of the run (around 3000), and then increasing suddenly to around 8000-9000 for the rest. This might be worth investigating further.
dict_mem_table_create() creating mutexes and rwlocks all the time is a known issue: http://bugs.mysql.com/bug.php?id=71708. It was here forever, made worse in Oracle 5.6.16, fully fixed in Percona 5.6.16. Oracle should have a partial fix in 5.6.19 and full in 5.7.
I wonder if the InnoDB team @ Oracle is doing something for this in 5.7? Does anyone know? I vaguely recall reading something about it, but I am not sure.
5.7 allows different mutex implementations to co-exist, and there is a new implementation that uses futexes. The sync array implementation is still there too. The code pushed so far seems to focus on getting the framework right and adding implementations more than on performance. I'd expect that to change in the later pushes.
It would seem a waste to duplicate their efforts.
There are Percona's efforts too ;) -- Laurynas