Re: [Maria-developers] MDEV-4956 - Reduce usage of LOCK_open: TABLE_SHARE::tdc.used_tables

16 Sep 2013

      Hi Sergei,

On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
...
Hi, Sergey!
On Sep 13, Sergey Vojtovich wrote:
...
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6
in a specific case. It is known to be caused by tc_acquire_table() and
tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table?
Threads spend a lot of time waiting for LOCK_open in these functions.
Because protected by LOCK_open code takes a lot of time to execute.
...
In what specific case?
The case is: many threads access one table (read-only OLTP).
...
...
...
...
...
Why per-share lists are updated under the global mutex?
Alas, it doesn't solve CPU cache coherence problem.
It doesn't solve CPU cache coherence problem, yes.
And it doesn't help if you have only one hot table.
But it certainly helps if many threads access many tables.
Ok, let's agree to agree: it will help in certain cases. Most probably it
won't improve situation much if all threads access single table.
Of course.
...
We could try to ensure that per-share mutex is on the same cache line as
free_tables and used_tables list heads. In this case I guess
mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into
CPU cache along with mutex structure. OTOH we still have to read per-TABLE
prev/next pointers. And in 5.6 per-partition mutex should less frequently
jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem?
What is the main problem that impacts the performance?
We (Axel and me) did a lot of different benchmarks before we concluded
cache misses to be the main problem. Please let me known if you're interested
in specific results - we either find them in benchmark archives or benchmark
again.

One of interesting results I just found is as following...
10.0.4, read-only OLTP, 64 threads, tps ~10000
+---------------------------------------------+------------+-----------------+
| event_name                                  | count_star | sum_timer_wait  |
+---------------------------------------------+------------+-----------------+
| wait/synch/mutex/sql/LOCK_open              |    2784632 | 161835901661916 |
| wait/synch/mutex/mysys/THR_LOCK::mutex      |    2784556 |  28804019775192 |
...skip...

Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time
differs ~6x.

Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait
go down from 161s to 100s.

Regards,
Sergey