[Maria-developers] MDEV-7004 review
Hi, Sergey I didn't review MDEV-7200 (InnoDB) and patches that are already in 10.1. Almost everything else was fine, see few comments below.
------------------------------------------------------------ revno: 4513 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0-power timestamp: Fri 2014-11-07 22:03:18 +0400 message: Multi-rwlock lock for table definition cache hash. added: modified: diff: === modified file 'sql/table_cache.cc' --- sql/table_cache.cc 2014-10-21 12:20:39 +0000 +++ sql/table_cache.cc 2014-11-07 18:03:18 +0000 @@ -53,6 +53,77 @@ #include "table.h" #include "sql_base.h"
+ +/** + Table definition hash lock. + + Lock consists of multiple rw-locks (8 by default). Lock is write locked + when all instances are write locked. Lock is read locked when any instance + is read locked. + + This class was implemented to workaround scalability bottleneck of table + definition hash rw-lock. + + To be replaced with lock-free hash when we understand how to solve the + following problems: + - lock-free hash iterator (can be workarounded by introducing global + all_tables list);
I don't particularly like that solution. Could you try to switch to lock-free hash now? I'll try to do the iterator asap
+ - lf_hash_search() may fail because of OOM. It is rather complex to handle + it properly. +*/ + +class TDC_lock +{ + uint m_instances; + bool m_write_locked; + char pad[128]; + struct st_locks + { + mysql_rwlock_t lock; + char pad[128];
In these cases I usually do: pad[128 - sizeof(mysql_rwlock_t)];
+ } *locks; +public: + int init(PSI_rwlock_key key, uint instances) + { + m_instances= instances; + m_write_locked= false; + locks= (st_locks *) my_malloc(sizeof(struct st_locks) * m_instances, MYF(MY_WME)); + if (!locks) + return 1; + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_init(key, &locks[i].lock); + return 0; + } + void destroy(void) + { + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_destroy(&locks[i].lock); + my_free(locks); + } + void wrlock(void) + { + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_wrlock(&locks[i].lock); + m_write_locked= true; + } + void rdlock(THD *thd= 0) + { + mysql_rwlock_rdlock(&locks[thd ? thd->thread_id % m_instances : 0].lock); + } + void unlock(THD *thd= 0) + { + if (m_write_locked) + { + m_write_locked= false; + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_unlock(&locks[i].lock); + } + else + mysql_rwlock_unlock(&locks[thd ? thd->thread_id % m_instances : 0].lock); + } +}; + + /** Configuration. */ ulong tdc_size; /**< Table definition cache threshold for LRU eviction. */ ulong tc_size; /**< Table cache threshold for LRU eviction. */ ------------------------------------------------------------ revno: 4512 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0-power timestamp: Fri 2014-10-31 15:44:36 +0400 message: Preallocate locks on THD mem_root to avoid expensive malloc. modified: diff: === modified file 'sql/lock.cc' --- sql/lock.cc 2014-09-30 17:31:14 +0000 +++ sql/lock.cc 2014-10-31 11:44:36 +0000 @@ -700,7 +699,7 @@ MYSQL_LOCK *get_lock_data(THD *thd, TABL TABLE **to, **table_buf; DBUG_ENTER("get_lock_data");
- DBUG_ASSERT((flags == GET_LOCK_UNLOCK) || (flags == GET_LOCK_STORE_LOCKS)); + DBUG_ASSERT((flags & GET_LOCK_UNLOCK) || (flags & GET_LOCK_STORE_LOCKS));
Not the same, old assert also verified that these flags aren't set both. New assert doesn't. But I don't think it matters much.
DBUG_PRINT("info", ("count %d", count));
for (i=lock_count=table_count=0 ; i < count ; i++) ------------------------------------------------------------ revno: 4508 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0 timestamp: Tue 2014-10-21 16:20:39 +0400 message: modified: diff: === modified file 'sql/handler.cc' --- sql/handler.cc 2014-11-13 10:01:31 +0000 +++ sql/handler.cc 2014-10-21 12:20:39 +0000 @@ -410,13 +410,16 @@ static int hton_ext_based_table_discover static void update_discovery_counters(handlerton *hton, int val) { if (hton->discover_table_existence == full_discover_for_existence) - my_atomic_add32(&need_full_discover_for_existence, val); + my_atomic_add32_explicit(&need_full_discover_for_existence, val, + MY_MEMORY_ORDER_RELAXED);
if (hton->discover_table_names) - my_atomic_add32(&engines_with_discover_table_names, val); + my_atomic_add32_explicit(&engines_with_discover_table_names, val, + MY_MEMORY_ORDER_RELAXED);
if (hton->discover_table) - my_atomic_add32(&engines_with_discover, val); + my_atomic_add32_explicit(&engines_with_discover, val, + MY_MEMORY_ORDER_RELAXED);
I wouldn't do that. Technically there's a dependency between these counters and loaded engines, so the order, kind of, matters. I don't think there is any practical risk of bugs here, but, on the other hand, these counters are only updated when an engine is loaded or unloaded, so there's no need to bother with the relaxed memory ordering here.
}
int ha_finalize_handlerton(st_plugin_int *plugin)
=== modified file 'sql/mysqld.cc' --- sql/mysqld.cc 2014-11-18 21:25:47 +0000 +++ sql/mysqld.cc 2014-10-21 12:20:39 +0000 @@ -3884,7 +3884,7 @@ static void my_malloc_size_cb_func(long } // workaround for gcc 4.2.4-1ubuntu4 -fPIE (from DEB_BUILD_HARDENING=1) int64 volatile * volatile ptr=&global_status_var.memory_used; - my_atomic_add64(ptr, size); + my_atomic_add64_explicit(ptr, size, MY_MEMORY_ORDER_RELAXED);
good!
} }
=== modified file 'sql/mysqld.h' --- sql/mysqld.h 2014-11-13 08:56:28 +0000 +++ sql/mysqld.h 2014-10-21 12:20:39 +0000 @@ -629,7 +629,7 @@ inline __attribute__((warn_unused_result { query_id_t id; my_atomic_rwlock_wrlock(&global_query_id_lock); - id= my_atomic_add64(&global_query_id, 1); + id= my_atomic_add64_explicit(&global_query_id, 1, MY_MEMORY_ORDER_RELAXED);
I believe this is fine
my_atomic_rwlock_wrunlock(&global_query_id_lock); return (id); } === modified file 'sql/table_cache.cc' --- sql/table_cache.cc 2014-06-10 18:20:33 +0000 +++ sql/table_cache.cc 2014-10-21 12:20:39 +0000 @@ -139,7 +139,7 @@ uint tc_records(void) { uint count; my_atomic_rwlock_rdlock(&LOCK_tdc_atomics); - count= my_atomic_load32(&tc_count); + count= my_atomic_load32_explicit(&tc_count, MY_MEMORY_ORDER_RELAXED);
I'm not sure about this file. You should know better whether it's safe to use relaxed memory ordering here.
my_atomic_rwlock_rdunlock(&LOCK_tdc_atomics); return count; }
Regards, Sergei
Hi Sergey, thanks for looking into this! On Wed, Nov 26, 2014 at 08:00:44PM +0100, Sergei Golubchik wrote:
Hi, Sergey
I didn't review MDEV-7200 (InnoDB) and patches that are already in 10.1. Patch for MDEV-7200 is not good to go into 10.1 as is anyway. It is here just as a reminder that we can gain another 2% with simple fix.
Almost everything else was fine, see few comments below.
------------------------------------------------------------ revno: 4513 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0-power timestamp: Fri 2014-11-07 22:03:18 +0400 message: Multi-rwlock lock for table definition cache hash. added: modified: diff: === modified file 'sql/table_cache.cc' --- sql/table_cache.cc 2014-10-21 12:20:39 +0000 +++ sql/table_cache.cc 2014-11-07 18:03:18 +0000 @@ -53,6 +53,77 @@ #include "table.h" #include "sql_base.h"
+ +/** + Table definition hash lock. + + Lock consists of multiple rw-locks (8 by default). Lock is write locked + when all instances are write locked. Lock is read locked when any instance + is read locked. + + This class was implemented to workaround scalability bottleneck of table + definition hash rw-lock. + + To be replaced with lock-free hash when we understand how to solve the + following problems: + - lock-free hash iterator (can be workarounded by introducing global + all_tables list);
I don't particularly like that solution. Could you try to switch to lock-free hash now?
I'll try to do the iterator asap This is the change that allowed us to break through 27kTPS to 33kTPS. I don't
Some of my patches depend on Monty's patches that haven't actually been pushed to 10.1 yet. I'll hold those. Further comments inline. like it either. My raw lf-hash prototype shows similar results. I think we hardly need to have this bottleneck solved in 10.1 one way or another. So my question is: do you feel like we will be able to implement this all on time for 10.1?
+ - lf_hash_search() may fail because of OOM. It is rather complex to handle + it properly. +*/ + +class TDC_lock +{ + uint m_instances; + bool m_write_locked; + char pad[128];
Interesting, why did I add the above padding... probably leftover from intermediate patch. Will remove it.
+ struct st_locks + { + mysql_rwlock_t lock; + char pad[128];
In these cases I usually do: pad[128 - sizeof(mysql_rwlock_t)]; Hmm... strange that I didn't define macro for this, I thought I did.
Your suggestion guarantees that st_locks will be at least 128 bytes. But these 128 bytes may be split across 2 cache lines. Or malloc returns pointer aligned on cache line size?
+ } *locks; +public: + int init(PSI_rwlock_key key, uint instances) + { + m_instances= instances; + m_write_locked= false; + locks= (st_locks *) my_malloc(sizeof(struct st_locks) * m_instances, MYF(MY_WME)); + if (!locks) + return 1; + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_init(key, &locks[i].lock); + return 0; + } + void destroy(void) + { + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_destroy(&locks[i].lock); + my_free(locks); + } + void wrlock(void) + { + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_wrlock(&locks[i].lock); + m_write_locked= true; + } + void rdlock(THD *thd= 0) + { + mysql_rwlock_rdlock(&locks[thd ? thd->thread_id % m_instances : 0].lock); + } + void unlock(THD *thd= 0) + { + if (m_write_locked) + { + m_write_locked= false; + for (uint i= 0; i < m_instances; i++) + mysql_rwlock_unlock(&locks[i].lock); + } + else + mysql_rwlock_unlock(&locks[thd ? thd->thread_id % m_instances : 0].lock); + } +}; + + /** Configuration. */ ulong tdc_size; /**< Table definition cache threshold for LRU eviction. */ ulong tc_size; /**< Table cache threshold for LRU eviction. */ ------------------------------------------------------------ revno: 4512 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0-power timestamp: Fri 2014-10-31 15:44:36 +0400 message: Preallocate locks on THD mem_root to avoid expensive malloc. modified: diff: === modified file 'sql/lock.cc' --- sql/lock.cc 2014-09-30 17:31:14 +0000 +++ sql/lock.cc 2014-10-31 11:44:36 +0000 @@ -700,7 +699,7 @@ MYSQL_LOCK *get_lock_data(THD *thd, TABL TABLE **to, **table_buf; DBUG_ENTER("get_lock_data");
- DBUG_ASSERT((flags == GET_LOCK_UNLOCK) || (flags == GET_LOCK_STORE_LOCKS)); + DBUG_ASSERT((flags & GET_LOCK_UNLOCK) || (flags & GET_LOCK_STORE_LOCKS));
Not the same, old assert also verified that these flags aren't set both. New assert doesn't. But I don't think it matters much.
You're right, I'll change this: the stricter - the better.
DBUG_PRINT("info", ("count %d", count));
for (i=lock_count=table_count=0 ; i < count ; i++) ------------------------------------------------------------ revno: 4508 committer: Sergey Vojtovich <svoj@mariadb.org> branch nick: 10.0 timestamp: Tue 2014-10-21 16:20:39 +0400 message: modified: diff: === modified file 'sql/handler.cc' --- sql/handler.cc 2014-11-13 10:01:31 +0000 +++ sql/handler.cc 2014-10-21 12:20:39 +0000 @@ -410,13 +410,16 @@ static int hton_ext_based_table_discover static void update_discovery_counters(handlerton *hton, int val) { if (hton->discover_table_existence == full_discover_for_existence) - my_atomic_add32(&need_full_discover_for_existence, val); + my_atomic_add32_explicit(&need_full_discover_for_existence, val, + MY_MEMORY_ORDER_RELAXED);
if (hton->discover_table_names) - my_atomic_add32(&engines_with_discover_table_names, val); + my_atomic_add32_explicit(&engines_with_discover_table_names, val, + MY_MEMORY_ORDER_RELAXED);
if (hton->discover_table) - my_atomic_add32(&engines_with_discover, val); + my_atomic_add32_explicit(&engines_with_discover, val, + MY_MEMORY_ORDER_RELAXED);
I wouldn't do that. Technically there's a dependency between these counters and loaded engines, so the order, kind of, matters. I don't think there is any practical risk of bugs here, but, on the other hand, these counters are only updated when an engine is loaded or unloaded, so there's no need to bother with the relaxed memory ordering here.
Ok, I'll revert this.
}
int ha_finalize_handlerton(st_plugin_int *plugin)
=== modified file 'sql/mysqld.cc' --- sql/mysqld.cc 2014-11-18 21:25:47 +0000 +++ sql/mysqld.cc 2014-10-21 12:20:39 +0000 @@ -3884,7 +3884,7 @@ static void my_malloc_size_cb_func(long } // workaround for gcc 4.2.4-1ubuntu4 -fPIE (from DEB_BUILD_HARDENING=1) int64 volatile * volatile ptr=&global_status_var.memory_used; - my_atomic_add64(ptr, size); + my_atomic_add64_explicit(ptr, size, MY_MEMORY_ORDER_RELAXED);
good!
} }
=== modified file 'sql/mysqld.h' --- sql/mysqld.h 2014-11-13 08:56:28 +0000 +++ sql/mysqld.h 2014-10-21 12:20:39 +0000 @@ -629,7 +629,7 @@ inline __attribute__((warn_unused_result { query_id_t id; my_atomic_rwlock_wrlock(&global_query_id_lock); - id= my_atomic_add64(&global_query_id, 1); + id= my_atomic_add64_explicit(&global_query_id, 1, MY_MEMORY_ORDER_RELAXED);
I believe this is fine
my_atomic_rwlock_wrunlock(&global_query_id_lock); return (id); } === modified file 'sql/table_cache.cc' --- sql/table_cache.cc 2014-06-10 18:20:33 +0000 +++ sql/table_cache.cc 2014-10-21 12:20:39 +0000 @@ -139,7 +139,7 @@ uint tc_records(void) { uint count; my_atomic_rwlock_rdlock(&LOCK_tdc_atomics); - count= my_atomic_load32(&tc_count); + count= my_atomic_load32_explicit(&tc_count, MY_MEMORY_ORDER_RELAXED);
I'm not sure about this file. You should know better whether it's safe to use relaxed memory ordering here.
It should be Ok. This counter should not have strict memory ordering constraints. Thanks, Sergey
Hi, Sergey! On Nov 27, Sergey Vojtovich wrote:
I don't particularly like that solution. Could you try to switch to lock-free hash now?
I'll try to do the iterator asap
This is the change that allowed us to break through 27kTPS to 33kTPS. I don't like it either. My raw lf-hash prototype shows similar results.
I think we hardly need to have this bottleneck solved in 10.1 one way or another. So my question is: do you feel like we will be able to implement this all on time for 10.1?
I've pushed the iterator in bb-lf-iterator branch and reassigned the issue to you. Please take a look, if you cannot use the API, I could try to change it. But note, because of lock-free nature of the algorithm, the iterator will not see the snapshot of the hash, it might miss some elements that were added at about the same time. It might also see some elements twice.
+ struct st_locks + { + mysql_rwlock_t lock; + char pad[128];
In these cases I usually do: pad[128 - sizeof(mysql_rwlock_t)]; Hmm... strange that I didn't define macro for this, I thought I did.
Your suggestion guarantees that st_locks will be at least 128 bytes. But these 128 bytes may be split across 2 cache lines. Or malloc returns pointer aligned on cache line size?
No, it doesn't. As far as I know. But I only care that different locks go into different cache lines, they don't need to be aligned to 128 byte boundary. Regards, Sergei
Hi Sergei, On Fri, Nov 28, 2014 at 12:00:49AM +0100, Sergei Golubchik wrote:
Hi, Sergey!
On Nov 27, Sergey Vojtovich wrote:
I don't particularly like that solution. Could you try to switch to lock-free hash now?
I'll try to do the iterator asap
This is the change that allowed us to break through 27kTPS to 33kTPS. I don't like it either. My raw lf-hash prototype shows similar results.
I think we hardly need to have this bottleneck solved in 10.1 one way or another. So my question is: do you feel like we will be able to implement this all on time for 10.1?
I've pushed the iterator in bb-lf-iterator branch and reassigned the issue to you. Please take a look, if you cannot use the API, I could try to change it. API is fine, thanks!
But note, because of lock-free nature of the algorithm, the iterator will not see the snapshot of the hash, it might miss some elements that were added at about the same time. It might also see some elements twice.
Eh, that's what I was afraid of. Not seeing elements that were added at about the same time should always be acceptable. Can't it miss elements that were in the list before iteration started (because it has moved to a different location)? The latter is unacceptable in many cases. But let's see: - print_cached_tables() - old debug stuff, remove it? - I_S.OPEN_TABLES - seeing elements twice is not good, use PFS? - close_cached_tables() - seeing elements twice is not a problem. - close_cached_connection_tables() - seeing elements twice is not a problem. - tc_purge() - seeing elements twice is not a problem. - tc_add_table() - seeing elements twice is not a problem. - mdl_iterate() - seeing elements twice is not good, use PFS?
+ struct st_locks + { + mysql_rwlock_t lock; + char pad[128];
In these cases I usually do: pad[128 - sizeof(mysql_rwlock_t)]; Hmm... strange that I didn't define macro for this, I thought I did.
Your suggestion guarantees that st_locks will be at least 128 bytes. But these 128 bytes may be split across 2 cache lines. Or malloc returns pointer aligned on cache line size?
No, it doesn't. As far as I know. But I only care that different locks go into different cache lines, they don't need to be aligned to 128 byte boundary.
What about this: sizeof(mysql_rwlock_t)= 64 (just rough estimation) sizeof(pad[128 - sizeof(mysql_rwlock_t)])= 64 It gives: +---------------+---------------+-------------+---------------+---------------+ | rw_lock_t 32b | rw_lock_t 32b | padding 64b | rw_lock_t 32b | rw_lock_t 32b | +---------------+---------------+-------------+---------------+---------------+ Cache line 128b | Cache line 128b |Cache line 128b ----------------+---------------------------------------------+---------------- Thanks, Sergey
Hi, Sergey! On Nov 28, Sergey Vojtovich wrote:
Eh, that's what I was afraid of. Not seeing elements that were added at about the same time should always be acceptable. Can't it miss elements that were in the list before iteration started (because it has moved to a different location)? The latter is unacceptable in many cases.
No, in this hash implementation elements are never moved in the list. It's prety difficult to move elements lock-free, that's why it wasn't easy to create a good performing lock-free hash. And this one works around this problem by not moving elements at all.
But let's see: - print_cached_tables() - old debug stuff, remove it? - I_S.OPEN_TABLES - seeing elements twice is not good, use PFS?
Duplicates can be removed on the upper layer. Like, I_S data are inserted into a temp table anyway, so a UNIQUE key would solve the issue of duplicates. Or a Unique object.
- close_cached_tables() - seeing elements twice is not a problem. - close_cached_connection_tables() - seeing elements twice is not a problem. - tc_purge() - seeing elements twice is not a problem. - tc_add_table() - seeing elements twice is not a problem. - mdl_iterate() - seeing elements twice is not good, use PFS?
mdl_iterate() - is it for metadata_lock_info plugin? Same answer then.
What about this: sizeof(mysql_rwlock_t)= 64 (just rough estimation) sizeof(pad[128 - sizeof(mysql_rwlock_t)])= 64
It gives: +---------------+---------------+-------------+---------------+---------------+ | rw_lock_t 32b | rw_lock_t 32b | padding 64b | rw_lock_t 32b | rw_lock_t 32b | +---------------+---------------+-------------+---------------+---------------+ Cache line 128b | Cache line 128b |Cache line 128b ----------------+---------------------------------------------+----------------
Okay, indeed. You're right. Regards, Sergei
Hi Sergei, On Fri, Nov 28, 2014 at 10:35:09AM +0100, Sergei Golubchik wrote:
Hi, Sergey!
On Nov 28, Sergey Vojtovich wrote:
Eh, that's what I was afraid of. Not seeing elements that were added at about the same time should always be acceptable. Can't it miss elements that were in the list before iteration started (because it has moved to a different location)? The latter is unacceptable in many cases.
No, in this hash implementation elements are never moved in the list. It's prety difficult to move elements lock-free, that's why it wasn't easy to create a good performing lock-free hash. And this one works around this problem by not moving elements at all. That's good.
But let's see: - print_cached_tables() - old debug stuff, remove it? - I_S.OPEN_TABLES - seeing elements twice is not good, use PFS?
Duplicates can be removed on the upper layer. Like, I_S data are inserted into a temp table anyway, so a UNIQUE key would solve the issue of duplicates. Or a Unique object.
Ok, it is acceptable. This all makes your hash iterator good for MDL and table cache. Thanks, Sergey
participants (2)
-
Sergei Golubchik
-
Sergey Vojtovich