Hi Sergei, On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 13, Sergey Vojtovich wrote:
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table? Threads spend a lot of time waiting for LOCK_open in these functions. Because protected by LOCK_open code takes a lot of time to execute.
In what specific case? The case is: many threads access one table (read-only OLTP).
Why per-share lists are updated under the global mutex? Alas, it doesn't solve CPU cache coherence problem. It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
Of course.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem? What is the main problem that impacts the performance?
We (Axel and me) did a lot of different benchmarks before we concluded cache misses to be the main problem. Please let me known if you're interested in specific results - we either find them in benchmark archives or benchmark again. One of interesting results I just found is as following... 10.0.4, read-only OLTP, 64 threads, tps ~10000 +---------------------------------------------+------------+-----------------+ | event_name | count_star | sum_timer_wait | +---------------------------------------------+------------+-----------------+ | wait/synch/mutex/sql/LOCK_open | 2784632 | 161835901661916 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 2784556 | 28804019775192 | ...skip... Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time differs ~6x. Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait go down from 161s to 100s. Regards, Sergey