Re: [Maria-developers] MDEV-4956 - Reduce usage of LOCK_open: TABLE_SHARE::tdc.used_tables
Hi, Sergey! On Aug 27, Sergey Vojtovich wrote:
At lp:maria/10.0
------------------------------------------------------------ revno: 3807 revision-id: svoj@mariadb.org-20130827121233-xh1uyhgfwbhedqyf parent: jplindst@mariadb.org-20130823060357-pww92qxla7o8iir7 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0 timestamp: Tue 2013-08-27 16:12:33 +0400 message: MDEV-4956 - Reduce usage of LOCK_open: TABLE_SHARE::tdc.used_tables
- tc_acquire_table and tc_release_table do not access TABLE_SHARE::tdc.used_tables anymore - in tc_acquire_table(): release LOCK_tdc after we relase LOCK_open (saves a few CPU cycles in critical section) - in tc_release_table(): if we reached table cache threshold, evict to-be-released table without moving it to unused_tables. unused_tables must be empty at this point (added assertion).
I don't understand what you're doing here, could you elaborate? E.g. explain in a changeset comment what you've done, why you introduced a new list for tables (all_share), what is its semantics, etc. Regards, Sergei
Hi Sergei, thanks for looking into this patch. Frankly speaking I find it a bit questionable too. Below are links that should answer your questions... What problem do I attempt to solve: https://lists.launchpad.net/maria-developers/msg06118.html How do I attempt to solve it: https://mariadb.atlassian.net/browse/MDEV-4956 32 connections issue simple SELECT against one table. Server has 4 CPU (32 cores + 32 HyperThreads). For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()). Every time we update global pointer, corresponding cache lines of sibling CPUs have to be invalidated. This is causing expensive memory reads while LOCK_open is held. Oracle solved this problem by partitioning table cache, allowing emulation of something like per-CPU lists. We attempted to split LOCK_open logically, and succeeded at everything but these 3 lists. I attempted lock-free list for free_tables, but TPS rate didn't improve. What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution. Why I find this patch questionable? It reduces LOCK_open wait time by 30%, to get close to Oracle wait time, we need to reduce wait time by 90%. We could remove unused_tables as well, but it will be 60% not 90%. Thanks, Sergey On Mon, Sep 09, 2013 at 06:14:13PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Aug 27, Sergey Vojtovich wrote:
At lp:maria/10.0
------------------------------------------------------------ revno: 3807 revision-id: svoj@mariadb.org-20130827121233-xh1uyhgfwbhedqyf parent: jplindst@mariadb.org-20130823060357-pww92qxla7o8iir7 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0 timestamp: Tue 2013-08-27 16:12:33 +0400 message: MDEV-4956 - Reduce usage of LOCK_open: TABLE_SHARE::tdc.used_tables
- tc_acquire_table and tc_release_table do not access TABLE_SHARE::tdc.used_tables anymore - in tc_acquire_table(): release LOCK_tdc after we relase LOCK_open (saves a few CPU cycles in critical section) - in tc_release_table(): if we reached table cache threshold, evict to-be-released table without moving it to unused_tables. unused_tables must be empty at this point (added assertion).
I don't understand what you're doing here, could you elaborate? E.g. explain in a changeset comment what you've done, why you introduced a new list for tables (all_share), what is its semantics, etc.
Regards, Sergei
Hi, Sergey! On Sep 10, Sergey Vojtovich wrote:
Hi Sergei,
thanks for looking into this patch. Frankly speaking I find it a bit questionable too. Below are links that should answer your questions... What problem do I attempt to solve: https://lists.launchpad.net/maria-developers/msg06118.html How do I attempt to solve it: https://mariadb.atlassian.net/browse/MDEV-4956
Yes, I've seen and remember both, but they don't answer my question, which was about specific changes that you've done, not about the goal. But ok, see below.
For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex?
Every time we update global pointer, corresponding cache lines of sibling CPUs have to be invalidated. This is causing expensive memory reads while LOCK_open is held.
Oracle solved this problem by partitioning table cache, allowing emulation of something like per-CPU lists.
We attempted to split LOCK_open logically, and succeeded at everything but these 3 lists. I attempted lock-free list for free_tables, but TPS rate didn't improve.
How did you do the lock-free list, could you show, please?
What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a thread adding a table only to "this thread's" list. Of course, it's not complete partitioning betwen CPUs, as any thread can remove a table from any list. But at least there won't be one global list head pointer.
Why I find this patch questionable? It reduces LOCK_open wait time by 30%, to get close to Oracle wait time, we need to reduce wait time by 90%. We could remove unused_tables as well, but it will be 60% not 90%.
Hmm, if you're only interested in optimizing this specific use case - one table, many threads - then yes, may be. But if you have many tables, then modifying per-share lists under the share own mutex is, basically, a must. Regards, Sergei
Hi Sergei, On Tue, Sep 10, 2013 at 09:11:16PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 10, Sergey Vojtovich wrote:
Hi Sergei,
thanks for looking into this patch. Frankly speaking I find it a bit questionable too. Below are links that should answer your questions... What problem do I attempt to solve: https://lists.launchpad.net/maria-developers/msg06118.html How do I attempt to solve it: https://mariadb.atlassian.net/browse/MDEV-4956
Yes, I've seen and remember both, but they don't answer my question, which was about specific changes that you've done, not about the goal. But ok, see below.
For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex? I would have done that already if it would give us considerable performance gain. Alas, it doesn't solve CPU cache coherence problem.
Another reason is global unused_tables list, which cannot be protected by per-share mutex. So we'll have to lock additional mutex twice per query. The third reason (less important) is that I'm not sure if TABLE_SHARE::visit_subgraph can be protected by per-share index. I suspect it needs to see consistent table cache state. The fourth reason (even less important) is that it is quite complex: there're invariants between free_tables, unused_tables and in_use.
Every time we update global pointer, corresponding cache lines of sibling CPUs have to be invalidated. This is causing expensive memory reads while LOCK_open is held.
Oracle solved this problem by partitioning table cache, allowing emulation of something like per-CPU lists.
We attempted to split LOCK_open logically, and succeeded at everything but these 3 lists. I attempted lock-free list for free_tables, but TPS rate didn't improve.
How did you do the lock-free list, could you show, please?
Please find it attached. It is mixed with different changes, just search for my_atomic_casptr.
What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a thread adding a table only to "this thread's" list. Of course, it's not complete partitioning betwen CPUs, as any thread can remove a table from any list. But at least there won't be one global list head pointer.
Yes, that's what Oracle did and what we're trying to avoid.
Why I find this patch questionable? It reduces LOCK_open wait time by 30%, to get close to Oracle wait time, we need to reduce wait time by 90%. We could remove unused_tables as well, but it will be 60% not 90%.
Hmm, if you're only interested in optimizing this specific use case - one table, many threads - then yes, may be. But if you have many tables, then modifying per-share lists under the share own mutex is, basically, a must.
If you mean the oposite situation when N threads accessing N tables (each thread accessing it's own table): there should be no problem, because every thread is accessing it's own lists. Well, except for unused_tables of course, but it can't be protected by per-share mutex anyway (tested a few months ago, but we could ask XL to double check the above case if needed). Thanks, Sergey
Hi, Sergey! On Sep 11, Sergey Vojtovich wrote:
For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex?
I would have done that already if it would give us considerable performance gain. Alas, it doesn't solve CPU cache coherence problem.
It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables.
How did you do the lock-free list, could you show, please? Please find it attached. It is mixed with different changes, just search for my_atomic_casptr.
Thanks.
What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a thread adding a table only to "this thread's" list. Of course, it's not complete partitioning betwen CPUs, as any thread can remove a table from any list. But at least there won't be one global list head pointer. Yes, that's what Oracle did and what we're trying to avoid.
I thought they've partitioned the TDC itself. And sometimes they need to lock all the partitions. If you only partition the unused_tables list, the TDC is shared by all threads and you always lock only one unused_tables list, never all of them. Regards, Sergei
Hi Sergei, comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how? On Thu, Sep 12, 2013 at 10:13:30PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 11, Sergey Vojtovich wrote:
For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex?
I would have done that already if it would give us considerable performance gain. Alas, it doesn't solve CPU cache coherence problem.
It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table.
But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
How did you do the lock-free list, could you show, please? Please find it attached. It is mixed with different changes, just search for my_atomic_casptr.
Thanks.
What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a thread adding a table only to "this thread's" list. Of course, it's not complete partitioning betwen CPUs, as any thread can remove a table from any list. But at least there won't be one global list head pointer. Yes, that's what Oracle did and what we're trying to avoid.
I thought they've partitioned the TDC itself. And sometimes they need to lock all the partitions. If you only partition the unused_tables list, the TDC is shared by all threads and you always lock only one unused_tables list, never all of them.
Since they didn't split locks logically, yes, they had to do more complex solution: they have global hash of TABLE_SHARE objects (protected by LOCK_open) + per-partition hash of Table_cache_element objects (protected by per-partition lock). class Table_cache_element { TABLE_list used_tables; TABLE_list free_tables; TABLE_SHARE *share; } class Table_cache // table cache partition { mysql_mutex_t m_lock; HASH m_cache; // collection of Table_cache_elements objects TABLE *m_unused_tables; uint m_table_count; } Except for "m_cache", per-share mutex protects exactly what is protected by our LOCK_open currently. Thanks, Sergey
Hi, Sergey! On Sep 13, Sergey Vojtovich wrote:
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table? In what specific case?
Why per-share lists are updated under the global mutex? Alas, it doesn't solve CPU cache coherence problem. It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
Of course.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem? What is the main problem that impacts the performance? Regards, Sergei
Hi Sergei, On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 13, Sergey Vojtovich wrote:
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table? Threads spend a lot of time waiting for LOCK_open in these functions. Because protected by LOCK_open code takes a lot of time to execute.
In what specific case? The case is: many threads access one table (read-only OLTP).
Why per-share lists are updated under the global mutex? Alas, it doesn't solve CPU cache coherence problem. It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
Of course.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem? What is the main problem that impacts the performance?
We (Axel and me) did a lot of different benchmarks before we concluded cache misses to be the main problem. Please let me known if you're interested in specific results - we either find them in benchmark archives or benchmark again. One of interesting results I just found is as following... 10.0.4, read-only OLTP, 64 threads, tps ~10000 +---------------------------------------------+------------+-----------------+ | event_name | count_star | sum_timer_wait | +---------------------------------------------+------------+-----------------+ | wait/synch/mutex/sql/LOCK_open | 2784632 | 161835901661916 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 2784556 | 28804019775192 | ...skip... Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time differs ~6x. Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait go down from 161s to 100s. Regards, Sergey
Hi Sergei, just found another interesting test result. I added dummy LOCK_table_share mutex lock and unlock to tc_acquire_table() and tc_release_table() (before locking LOCK_open), just to measure pure mutex wait time. Test execution time: 45s LOCK_open wait time: 34s LOCK_table_share wait time: 0.8s +--------------------------------------------------------+------------+----------------+ | event_name | count_star | sum_timer_wait | +--------------------------------------------------------+------------+----------------+ | wait/synch/mutex/sql/LOCK_open | 585690 | 34298972259258 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 585604 | 4560420039042 | | wait/synch/mutex/sql/TABLE_SHARE::tdc.LOCK_table_share | 585710 | 794564626359 | | wait/synch/rwlock/sql/LOCK_tdc | 290940 | 237751940139 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 1838668 | 219829105251 | | wait/synch/rwlock/innodb/hash table locks | 683395 | 159792339294 | | wait/synch/rwlock/innodb/btr_search_latch | 290892 | 138915354207 | | wait/synch/mutex/innodb/trx_sys_mutex | 62940 | 78334973451 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 167822 | 49323455349 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 41970 | 31436713938 | +--------------------------------------------------------+------------+----------------+ Regards, Sergey On Mon, Sep 16, 2013 at 04:46:41PM +0400, Sergey Vojtovich wrote:
Hi Sergei,
On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 13, Sergey Vojtovich wrote:
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table? Threads spend a lot of time waiting for LOCK_open in these functions. Because protected by LOCK_open code takes a lot of time to execute.
In what specific case? The case is: many threads access one table (read-only OLTP).
Why per-share lists are updated under the global mutex? Alas, it doesn't solve CPU cache coherence problem. It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
Of course.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem? What is the main problem that impacts the performance?
We (Axel and me) did a lot of different benchmarks before we concluded cache misses to be the main problem. Please let me known if you're interested in specific results - we either find them in benchmark archives or benchmark again.
One of interesting results I just found is as following... 10.0.4, read-only OLTP, 64 threads, tps ~10000 +---------------------------------------------+------------+-----------------+ | event_name | count_star | sum_timer_wait | +---------------------------------------------+------------+-----------------+ | wait/synch/mutex/sql/LOCK_open | 2784632 | 161835901661916 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 2784556 | 28804019775192 | ...skip...
Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time differs ~6x.
Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait go down from 161s to 100s.
Regards, Sergey
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi Sergei, below are results of "operf -e LLC_MISSES:10000:0x41 --pid `pidof mysqld`". Tested with performance schema and general log off. tc_acquire_table() seem to be inlined, so it appears as tdc_acquire_share(). Also note that my_malloc_size_cb_func() looks like another bottleneck. Misses summary for tc_acquire_table()/tc_release_table() -------------------------------------------------------- 4.5408 + 3.0090= 7.5498% (10.0) 2.9916 + 2.6908= 5.6824% (10.0 + MDEV4956) 2.3502 + 1.7159= 4.0661% (10.0 + MDEV4956 + no unused_tables) 10.0 (rev.3912) --------------- CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % image name symbol name 40292 34.8688 no-vmlinux /no-vmlinux 21638 18.7256 libpthread-2.15.so pthread_mutex_lock 6718 5.8138 libpthread-2.15.so pthread_mutex_unlock 5247 4.5408 mysqld tc_release_table(TABLE*) 3477 3.0090 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**) 3410 2.9510 mysqld TABLE::init(THD*, TABLE_LIST*) 3316 2.8697 mysqld my_malloc_size_cb_func 3126 2.7053 mysqld open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 2360 2.0424 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int) 2254 1.9506 libpthread-2.15.so pthread_rwlock_unlock 2152 1.8623 libpthread-2.15.so __lll_lock_wait 1867 1.6157 mysqld heap_info 1525 1.3197 mysqld tdc_refresh_version() 1441 1.2470 mysqld read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*) 1374 1.1891 mysqld make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1218 1.0541 mysqld get_lock_data(THD*, TABLE**, unsigned int, unsigned int) 1212 1.0489 mysqld THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int) 1062 0.9191 mysqld _my_thread_var 945 0.8178 libc-2.15.so __memset_sse2 875 0.7572 mysqld thr_unlock 724 0.6266 mysqld lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) 673 0.5824 mysqld thr_multi_lock 10.0 + MDEV4956 --------------- CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % image name symbol name 43070 35.7909 no-vmlinux /no-vmlinux 22661 18.8311 libpthread-2.15.so pthread_mutex_lock 7960 6.6147 libpthread-2.15.so pthread_mutex_unlock 3600 2.9916 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**) 3498 2.9068 mysqld my_malloc_size_cb_func 3238 2.6908 mysqld tc_release_table(TABLE*) 3189 2.6500 mysqld TABLE::init(THD*, TABLE_LIST*) 2611 2.1697 libpthread-2.15.so __lll_lock_wait 2414 2.0060 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int) 2360 1.9611 mysqld open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 2250 1.8697 mysqld heap_info 1889 1.5697 mysqld read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*) 1557 1.2939 mysqld make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1378 1.1451 mysqld THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int) 1319 1.0961 mysqld get_lock_data(THD*, TABLE**, unsigned int, unsigned int) 1293 1.0745 mysqld thr_multi_lock 1267 1.0529 libpthread-2.15.so pthread_rwlock_unlock 1211 1.0063 mysqld _my_thread_var 1136 0.9440 libc-2.15.so __memset_sse2 887 0.7371 mysqld thr_unlock 736 0.6116 mysqld free_root 698 0.5800 mysqld tdc_refresh_version() 10.0 + MDEV4956 + no unused_tables ---------------------------------- CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated) Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000 samples % image name symbol name 41828 35.5653 no-vmlinux /no-vmlinux 22377 19.0266 libpthread-2.15.so pthread_mutex_lock 8219 6.9884 libpthread-2.15.so pthread_mutex_unlock 3578 3.0423 mysqld my_malloc_size_cb_func 3244 2.7583 mysqld TABLE::init(THD*, TABLE_LIST*) 2764 2.3502 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**) 2630 2.2362 libpthread-2.15.so __lll_lock_wait 2570 2.1852 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int) 2407 2.0466 mysqld open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 2233 1.8987 mysqld heap_info 2027 1.7235 mysqld read_lock_type_for_table(THD*, Query_tables_list*, TABLE_LIST*) 2018 1.7159 mysqld tc_release_table(TABLE*) 1537 1.3069 mysqld make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1404 1.1938 mysqld THD::enter_stage(PSI_stage_info_v1 const*, PSI_stage_info_v1*, char const*, char const*, unsigned int) 1368 1.1632 mysqld _my_thread_var 1343 1.1419 mysqld thr_multi_lock 1284 1.0918 mysqld get_lock_data(THD*, TABLE**, unsigned int, unsigned int) 1213 1.0314 libc-2.15.so __memset_sse2 1162 0.9880 libpthread-2.15.so pthread_rwlock_unlock 799 0.6794 mysqld thr_unlock 753 0.6403 mysqld tdc_refresh_version() 717 0.6096 mysqld THD::set_open_tables(TABLE*) Regards, Sergey On Mon, Sep 16, 2013 at 08:50:49PM +0400, Sergey Vojtovich wrote:
Hi Sergei,
just found another interesting test result. I added dummy LOCK_table_share mutex lock and unlock to tc_acquire_table() and tc_release_table() (before locking LOCK_open), just to measure pure mutex wait time.
Test execution time: 45s LOCK_open wait time: 34s LOCK_table_share wait time: 0.8s
+--------------------------------------------------------+------------+----------------+ | event_name | count_star | sum_timer_wait | +--------------------------------------------------------+------------+----------------+ | wait/synch/mutex/sql/LOCK_open | 585690 | 34298972259258 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 585604 | 4560420039042 | | wait/synch/mutex/sql/TABLE_SHARE::tdc.LOCK_table_share | 585710 | 794564626359 | | wait/synch/rwlock/sql/LOCK_tdc | 290940 | 237751940139 | | wait/synch/mutex/sql/THD::LOCK_thd_data | 1838668 | 219829105251 | | wait/synch/rwlock/innodb/hash table locks | 683395 | 159792339294 | | wait/synch/rwlock/innodb/btr_search_latch | 290892 | 138915354207 | | wait/synch/mutex/innodb/trx_sys_mutex | 62940 | 78334973451 | | wait/synch/rwlock/innodb/index_tree_rw_lock | 167822 | 49323455349 | | wait/synch/rwlock/sql/MDL_lock::rwlock | 41970 | 31436713938 | +--------------------------------------------------------+------------+----------------+
Regards, Sergey
On Mon, Sep 16, 2013 at 04:46:41PM +0400, Sergey Vojtovich wrote:
Hi Sergei,
On Sat, Sep 14, 2013 at 04:44:28PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 13, Sergey Vojtovich wrote:
Hi Sergei,
comments inline and a question: 10.0 throughput is twice lower than 5.6 in a specific case. It is known to be caused by tc_acquire_table() and tc_release_table(). Do we want to fix it? If yes - how?
How is it caused by tc_acquire_table/tc_release_table? Threads spend a lot of time waiting for LOCK_open in these functions. Because protected by LOCK_open code takes a lot of time to execute.
In what specific case? The case is: many threads access one table (read-only OLTP).
> Why per-share lists are updated under the global mutex? Alas, it doesn't solve CPU cache coherence problem. It doesn't solve CPU cache coherence problem, yes. And it doesn't help if you have only one hot table. But it certainly helps if many threads access many tables. Ok, let's agree to agree: it will help in certain cases. Most probably it won't improve situation much if all threads access single table.
Of course.
We could try to ensure that per-share mutex is on the same cache line as free_tables and used_tables list heads. In this case I guess mysql_mutex_lock(&share->tdc.LOCK_table_share) will load list heads into CPU cache along with mutex structure. OTOH we still have to read per-TABLE prev/next pointers. And in 5.6 per-partition mutex should less frequently jump out of CPU cache than our per-share mutex. Worth trying?
Did you benchmark that these cache misses are a problem? What is the main problem that impacts the performance?
We (Axel and me) did a lot of different benchmarks before we concluded cache misses to be the main problem. Please let me known if you're interested in specific results - we either find them in benchmark archives or benchmark again.
One of interesting results I just found is as following... 10.0.4, read-only OLTP, 64 threads, tps ~10000 +---------------------------------------------+------------+-----------------+ | event_name | count_star | sum_timer_wait | +---------------------------------------------+------------+-----------------+ | wait/synch/mutex/sql/LOCK_open | 2784632 | 161835901661916 | | wait/synch/mutex/mysys/THR_LOCK::mutex | 2784556 | 28804019775192 | ...skip...
Note that LOCK_open and THR_LOCK::mutex are contested equally, but wait time differs ~6x.
Removing used_tables from tc_acquire_table/tc_release_table makes sum_timer_wait go down from 161s to 100s.
Regards, Sergey
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Hi Sergei, I just realized that I didn't share benchmark results (read-only OLTP that XL did): 5.6 tps: ~18k 10.0 tps: ~9k 10.0 + MDEV-4956 tps: ~11k I estimate tc_acquire_table and tc_release_table are eating up ~6k tps (2k per list). Regards, Sergey On Tue, Sep 10, 2013 at 09:11:16PM +0200, Sergei Golubchik wrote:
Hi, Sergey!
On Sep 10, Sergey Vojtovich wrote:
Hi Sergei,
thanks for looking into this patch. Frankly speaking I find it a bit questionable too. Below are links that should answer your questions... What problem do I attempt to solve: https://lists.launchpad.net/maria-developers/msg06118.html How do I attempt to solve it: https://mariadb.atlassian.net/browse/MDEV-4956
Yes, I've seen and remember both, but they don't answer my question, which was about specific changes that you've done, not about the goal. But ok, see below.
For every statment we acquire table from table cache and then release table back to the cache. That involves update of 3 lists: unused_tables, per-share used_tables and free_tables. These lists are protected by LOCK_open (see tc_acquire_table() and tc_release_table()).
Why per-share lists are updated under the global mutex?
Every time we update global pointer, corresponding cache lines of sibling CPUs have to be invalidated. This is causing expensive memory reads while LOCK_open is held.
Oracle solved this problem by partitioning table cache, allowing emulation of something like per-CPU lists.
We attempted to split LOCK_open logically, and succeeded at everything but these 3 lists. I attempted lock-free list for free_tables, but TPS rate didn't improve.
How did you do the lock-free list, could you show, please?
What we need is to reduce number of these expensive memory reads, and there are two solutions: partition these lists or get rid of them. As we agreed not to partition, I'm trying the latter solution.
Well, you can partition the list. With 32 list head pointers. And a thread adding a table only to "this thread's" list. Of course, it's not complete partitioning betwen CPUs, as any thread can remove a table from any list. But at least there won't be one global list head pointer.
Why I find this patch questionable? It reduces LOCK_open wait time by 30%, to get close to Oracle wait time, we need to reduce wait time by 90%. We could remove unused_tables as well, but it will be 60% not 90%.
Hmm, if you're only interested in optimizing this specific use case - one table, many threads - then yes, may be. But if you have many tables, then modifying per-share lists under the share own mutex is, basically, a must.
Regards, Sergei
Dnia 11-09-2013 o 09:15:47 Sergey Vojtovich <svoj@mariadb.org> napisał(a):
Hi Sergei,
I just realized that I didn't share benchmark results (read-only OLTP that XL did): 5.6 tps: ~18k 10.0 tps: ~9k 10.0 + MDEV-4956 tps: ~11k
I estimate tc_acquire_table and tc_release_table are eating up ~6k tps (2k per list).
Wasn't this difference caused by LOCK_plugin? -- Patryk Pomykalski
Hi Patryk, LOCK_plugin issue was caused by semisync plugins, which were disabled in subsequent benchmarks. When they were enabled tps was even lower: ~6.5k. Regards, Sergey On Wed, Sep 11, 2013 at 09:28:54AM +0200, Patryk Pomykalski wrote:
Dnia 11-09-2013 o 09:15:47 Sergey Vojtovich <svoj@mariadb.org> napisał(a):
Hi Sergei,
I just realized that I didn't share benchmark results (read-only OLTP that XL did): 5.6 tps: ~18k 10.0 tps: ~9k 10.0 + MDEV-4956 tps: ~11k
I estimate tc_acquire_table and tc_release_table are eating up ~6k tps (2k per list).
Wasn't this difference caused by LOCK_plugin?
-- Patryk Pomykalski
participants (3)
-
Patryk Pomykalski
-
Sergei Golubchik
-
Sergey Vojtovich