Hi Sergei, below are results of "operf -e LLC_MISSES:10000:0x41 --pid `pidof mysqld`". Tested with performance schema and general log off. tc_acquire_table() seem to be inlined, so it appears as tdc_acquire_share(). Also note that my_malloc_size_cb_func() looks like another bottleneck. Misses summary for tc_acquire_table()/tc_release_table() -------------------------------------------------------- 4.5408 + 3.0090= 7.5498% (10.0) 2.9916 + 2.6908= 5.6824% (10.0 + MDEV4956) 2.3502 + 1.7159= 4.0661% (10.0 + MDEV4956 + no unused_tables) 10.0 (rev.3912) --------------- CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % image name symbol name 40292 34.8688 no-vmlinux /no-vmlinux 21638 18.7256 libpthread-2.15.so pthread_mutex_lock 6718 5.8138 libpthread-2.15.so pthread_mutex_unlock 5247 4.5408 mysqld tc_release_table(TABLE*) 3477 3.0090 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**) 3410 2.9510 mysqld TABLE::init(THD*, TABLE_LIST*) 3316 2.8697 mysqld my_malloc_size_cb_func 3126 2.7053 mysqld open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 2360 2.0424 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int) 2254 1.9506 libpthread-2.15.so pthread_rwlock_unlock 2152 1.8623 libpthread-2.15.so __lll_lock_wait 1867 1.6157 mysqld heap_info 1525 1.3197 mysqld tdc_refresh_version() 1441 1.2470 mysqld read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*) 1374 1.1891 mysqld make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1218 1.0541 mysqld get_lock_data(THD*, TABLE**, unsigned int, unsigned int) 1212 1.0489 mysqld THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int) 1062 0.9191 mysqld _my_thread_var 945 0.8178 libc-2.15.so __memset_sse2 875 0.7572 mysqld thr_unlock 724 0.6266 mysqld lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) 673 0.5824 mysqld thr_multi_lock 10.0 + MDEV4956 --------------- CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % image name symbol name 43070 35.7909 no-vmlinux /no-vmlinux 22661 18.8311 libpthread-2.15.so pthread_mutex_lock 7960 6.6147 libpthread-2.15.so pthread_mutex_unlock 3600 2.9916 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**) 3498 2.9068 mysqld my_malloc_size_cb_func 3238 2.6908 mysqld tc_release_table(TABLE*) 3189 2.6500 mysqld TABLE::init(THD*, TABLE_LIST*) 2611 2.1697 libpthread-2.15.so __lll_lock_wait 2414 2.0060 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int) 2360 1.9611 mysqld open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 2250 1.8697 mysqld heap_info 1889 1.5697 mysqld read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*) 1557 1.2939 mysqld make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1378 1.1451 mysqld THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int) 1319 1.0961 mysqld get_lock_data(THD*, TABLE**, unsigned int, unsigned int) 1293 1.0745 mysqld thr_multi_lock 1267 1.0529 libpthread-2.15.so pthread_rwlock_unlock 1211 1.0063 mysqld _my_thread_var 1136 0.9440 libc-2.15.so __memset_sse2 887 0.7371 mysqld thr_unlock 736 0.6116 mysqld free_root 698 0.5800 mysqld tdc_refresh_version() 10.0 + MDEV4956 + no unused_tables ---------------------------------- CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % image name symbol name 41828 35.5653 no-vmlinux /no-vmlinux 22377 19.0266 libpthread-2.15.so pthread_mutex_lock 8219 6.9884 libpthread-2.15.so pthread_mutex_unlock 3578 3.0423 mysqld my_malloc_size_cb_func 3244 2.7583 mysqld TABLE::init(THD*, TABLE_LIST*) 2764 2.3502 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**) 2630 2.2362 libpthread-2.15.so __lll_lock_wait 2570 2.1852 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int) 2407 2.0466 mysqld open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 2233 1.8987 mysqld heap_info 2027 1.7235 mysqld read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*) 2018 1.7159 mysqld tc_release_table(TABLE*) 1537 1.3069 mysqld make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1404 1.1938 mysqld THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int) 1368 1.1632 mysqld _my_thread_var 1343 1.1419 mysqld thr_multi_lock 1284 1.0918 mysqld get_lock_data(THD*, TABLE**, unsigned int, unsigned int) 1213 1.0314 libc-2.15.so __memset_sse2 1162 0.9880 libpthread-2.15.so pthread_rwlock_unlock 799 0.6794 mysqld thr_unlock 753 0.6403 mysqld tdc_refresh_version() 717 0.6096 mysqld THD::set_open_tables(TABLE*) Regards, Sergey On Mon, Sep 16, 2013 at 08:50:49PM +0400, Sergey Vojtovich wrote:
Hi Sergei,
just found another interesting test result. I added dummy LOCK_table_share mutex lock and unlock to tc_acquire_table() and tc_release_table() (before locking LOCK_open), just to measure pure mutex wait time.
Test execution time: 45s LOCK_open wait time: 34s LOCK_table_share wait time: 0.8s
+--------------------------------------------------------+------------+----------------+ | event_name | count_star | sum_timer_wait | +--------------------------------------------------------+------------+----------------+ | wait/synch/mutex/sql/LOCK_open | 585690 | 34298972259258 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 585604 | 4560420039042 | | wait/synch/mutex/sql/TABLE_SHARE::tdc.LOCK_table_share | 585710 | 794564626359 | | wait/synch/rwlock/sql/LOCK_tdc | 290940 | 237751940139 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 1838668 | 219829105251 | | wait/synch/rwlock/innodb/hash table locks | 683395 | 159792339294 | | wait/synch/rwlock/innodb/btr_search_latch | 290892 | 138915354207 | | wait/synch/mutex/innodb/trx_sys_mutex | 62940 | 78334973451 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 167822 | 49323455349 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 41970 | 31436713938 | +--------------------------------------------------------+------------+----------------+
Regards, Sergey
On Mon, Sep 16, 2013 at 04:46:41PM +0400, Sergey Vojtovich wrote:
Hi Sergei,
On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 13, Sergey Vojtovich wrote:
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table? Threads spend a lot of time waiting for LOCK_open in these functions. Because protected by LOCK_open code takes a lot of time to execute.
In what specific case? The case is: many threads access one table (read-only OLTP).
> Why per-share lists are updated under the global mutex? Alas, it doesn't solve CPU cache coherence problem. It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
Of course.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem? What is the main problem that impacts the performance?
We (Axel and me) did a lot of different benchmarks before we concluded cache misses to be the main problem. Please let me known if you're interested in specific results - we either find them in benchmark archives or benchmark again.
One of interesting results I just found is as following... 10.0.4, read-only OLTP, 64 threads, tps ~10000 +---------------------------------------------+------------+-----------------+ | event_name | count_star | sum_timer_wait | +---------------------------------------------+------------+-----------------+ | wait/synch/mutex/sql/LOCK_open | 2784632 | 161835901661916 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 2784556 | 28804019775192 | ...skip...
Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time differs ~6x.
Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait go down from 161s to 100s.
Regards, Sergey
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp