Re: [Maria-developers] [Commits] Rev 4367: MDEV-7026: Race in InnoDB/XtraDB mutex implementation can stall or hang the server. in http://bazaar.launchpad.net/~maria-captains/maria/5.5
At http://bazaar.launchpad.net/~maria-captains/maria/5.5
------------------------------------------------------------ revno: 4367 revision-id: knielsen@knielsen-hq.org-20141119125646-njq02h7irmu9s71b parent: sergii@pisem.net-20141118231952-fk3o69075dn3ravv committer: Kristian Nielsen <knielsen@knielsen-hq.org> branch nick: mariadb-5.5 timestamp: Wed 2014-11-19 13:56:46 +0100 message: MDEV-7026: Race in InnoDB/XtraDB mutex implementation can stall or hang the server.
The bug was that full memory barrier was missing in the code that ensures that a waiter on an InnoDB mutex will not go to sleep unless it is guaranteed to be woken up again by another thread currently holding the mutex. This made possible a race where a thread could get stuck waiting for a mutex that is in fact no longer locked. If that thread was also holding other critical locks, this could stall the entire server. There is an error monitor thread than can break the stall, it runs about once per second. But if the error monitor thread itself got stuck or was not running, then the entire server could hang infinitely.
This was introduced on i386/amd64 platforms in 5.5.40 and 10.0.13 by an incorrect patch that tried to fix the similar problem for PowerPC. JFYI: Strictly speaking that "incorrect patch" attempted to fix different
Hi Kristian, thanks for taking over this bug! I believe this is the patch that was pushed into 5.5. Unfortunately I think there is still one potential for a hang left (details below). :( Since InnoDB and XtraDB changes are identical I'll comment only XtraDB changes. So you removed os_isync and implemented os_mb instead. That's good! You also removed a bunch of os_wmb/os_rmb calls. That's good too. And you kept RELEASE barrier for mutex_exit() on PPC64. Very good! I will also comment things that I don't really like, but they're just for your information (no action required). On Wed, Nov 19, 2014 at 01:56:46PM +0100, knielsen@knielsen-hq.org wrote: problem (read carefully): it added RELEASE barrier between "stores inside mutex" and "lock_word= 0" while this patch adds full barrier between "lock_word= 0" and "if (waiters)".
This commit reverts the incorrect PowerPC patch, and instead implements a fix for PowerPC that does not change i386/amd64 behaviour, making PowerPC work similarly to i386/amd64.
=== modified file 'storage/xtradb/include/os0sync.h' --- a/storage/xtradb/include/os0sync.h 2014-09-08 15:10:48 +0000 +++ b/storage/xtradb/include/os0sync.h 2014-11-19 12:56:46 +0000 @@ -314,11 +331,28 @@ amount of increment. */ /**********************************************************//** Returns the old value of *ptr, atomically sets *ptr to new_val */
-# define os_atomic_test_and_set_byte(ptr, new_val) \ +#ifdef __powerpc__ +/* + os_atomic_test_and_set_byte_release() should imply a release barrier before + setting, and a full barrier after. But __sync_lock_test_and_set() is only + documented as an aquire barrier. So on PowerPC we need to add the full + barrier explicitly. */ +# define os_atomic_test_and_set_byte_release(ptr, new_val) \ + do { __sync_lock_release(ptr); \ + __sync_synchronize(); } while (0) +#else +/* + On x86, __sync_lock_test_and_set() happens to be full barrier, due to + LOCK prefix. +*/ +# define os_atomic_test_and_set_byte_release(ptr, new_val) \ + __sync_lock_test_and_set(ptr, (byte) new_val) +#endif JFYI: "os_atomic_test_and_set_byte_release" sounds odd to me. Look at gcc manual: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html ... This builtin is not a full barrier, but rather an acquire barrier. This means
...skip... that references after the builtin cannot move to (or be speculated to) before the builtin, but previous memory stores may not be globally visible yet, and previous memory loads may not yet be satisfied. ... You name RELEASE something that is acqually ACQUIRE. So why didn't you like "os_atomic_lock_release_byte" which is RELASE btw: ... This builtin is not a full barrier, but rather a release barrier. This means that all previous memory stores are globally visible, and all previous memory loads have been satisfied, but following memory reads are not prevented from being speculated to before the barrier. ...
+/* + os_atomic_test_and_set_byte_acquire() is a full memory barrier on x86. But + in general, just an aquire barrier should be sufficient. */ +# define os_atomic_test_and_set_byte_acquire(ptr, new_val) \ __sync_lock_test_and_set(ptr, (byte) new_val) !!! The problem is here!!! Strictly speaking not here, but in mutex_spin_wait(): ... mutex_set_waiters(mutex, 1);
=== modified file 'storage/xtradb/include/sync0sync.ic' --- a/storage/xtradb/include/sync0sync.ic 2014-08-29 12:02:46 +0000 +++ b/storage/xtradb/include/sync0sync.ic 2014-11-19 12:56:46 +0000 @@ -80,11 +80,11 @@ mutex_test_and_set( mutex_t* mutex) /*!< in: mutex */ { #if defined(HAVE_ATOMIC_BUILTINS) - return(os_atomic_test_and_set_byte(&mutex->lock_word, 1)); + return(os_atomic_test_and_set_byte_acquire(&mutex->lock_word, 1)); #else ibool ret;
- ret = os_fast_mutex_trylock(&(mutex->os_fast_mutex)); + ret = os_fast_mutex_trylock_full_barrier(&(mutex->os_fast_mutex));
if (ret == 0) { /* We check that os_fast_mutex_trylock does not leak JFYI: This is probably the most funny thing. Here we're in the "#else" branch of "#if defined(HAVE_ATOMIC_BUILTINS)". E.g. if we compile with gcc and enter this branch it means: sync builtins are not avialble (gcc atomic builtins are very
/* Try to reserve still a few times */ for (i = 0; i < 4; i++) { if (mutex_test_and_set(mutex) == 0) { ... We'll get stuck if "waiters= 1" will become visible after "if (mutex_test_and_set(mutex))". I don't see anything preventing this reorder. ...skip... likely not available too). Now you say "os_fast_mutex_trylock_full_barrier" which does os_mb. But since neither sync bultins nor atomic builtins are available, os_mb is basically no-op. This may be different on other platforms, but to my knowledge memory barriers and atomic operations normally come together. Another thing: did you try compiling with this branch enabled? IIRC it didn't compile even berfore we started fixing it. ...skip... Regards, Sergey
Sergey Vojtovich <svoj@mariadb.org> writes:
+/* + os_atomic_test_and_set_byte_acquire() is a full memory barrier on x86. But + in general, just an aquire barrier should be sufficient. */ +# define os_atomic_test_and_set_byte_acquire(ptr, new_val) \ __sync_lock_test_and_set(ptr, (byte) new_val) !!! The problem is here!!! Strictly speaking not here, but in mutex_spin_wait(): ... mutex_set_waiters(mutex, 1);
/* Try to reserve still a few times */ for (i = 0; i < 4; i++) { if (mutex_test_and_set(mutex) == 0) { ... We'll get stuck if "waiters= 1" will become visible after "if (mutex_test_and_set(mutex))". I don't see anything preventing this reorder.
1. If INNODB_RW_LOCKS_USE_ATOMICS is set, then mytex_set_waiters() is os_compare_and_swap_ulint() which is __sync_bool_compare_and_swap(). This is documented as a full barrier. So it is not possible for "waiters=1" to become visible after "if (mutex_test_and_set(mutex))". 2. If INNODB_RW_LOCKS_USE_ATOMICS is _not_ set, but HAVE_ATOMIC_BUILTINS _is_ set, then os_atomic_test_and_set_byte_acquire() is used in mutex_test_and_set(). This is a full barrier on x86 (despite using the wrong GCC primitive), so should be ok there. On PowerPC it is probably only an acquire barrier, so you are probably right that the "waiters=1" may only become visible after "if (mutex_test_and_set(mutex))". If I understand correctly, this was the same before my patch also. 3. If neither INNODB_RW_LOCKS_USE_ATOMICS nor HAVE_ATOMIC_BUILTINS is set, then mutex_test_and_set() uses os_fast_mutex_trylock_full_barrier(), which again prevents "waiter=1" to be come visible too late, at least on x86. You may be right that the os_mb() in os_fast_mutex_trylock_full_barrier() is no-op in this case, so that PowerPC has only acquire barrier in trylock(), which is insufficent. Again, if I understand correctly, this is unchanged by my patch. So as I understand it, case (1) is correct for both x86 and PowerPC. Cases (2) and (3) should be correct for x86 also. (2) and (3) may be incorrect for PowerPC, but then they would have been incorrect before also, my patch does not change it. As you mentioned, it seems likely that problematic cases (2) and (3) do not occur at all in our builds, only working case (1) is used. If so, it seems we do not actually have the problem in current binaries. Maybe we could put in #ifdef __powerpc__ #error not supported compiler #endif for cases (2) and (3) to ensure this. Just to make it clear, I find it absolutely horrible that the InnoDB mutex code requires completely different reasoning to explain its correctness, depending on #ifdefs. It is almost impossible to be sure that one really understands how (and if) it works. That is why I find it so scary to change it in a stable release, and suggested something that does not change the existing, working x86 implementation.
Another thing: did you try compiling with this branch enabled? IIRC it didn't compile even berfore we started fixing it.
I did not try compiling anything except whatever is the default. - Kristian.
Hi Kristian, On Thu, Nov 20, 2014 at 02:28:54PM +0100, Kristian Nielsen wrote:
Sergey Vojtovich <svoj@mariadb.org> writes:
+/* + os_atomic_test_and_set_byte_acquire() is a full memory barrier on x86. But + in general, just an aquire barrier should be sufficient. */ +# define os_atomic_test_and_set_byte_acquire(ptr, new_val) \ __sync_lock_test_and_set(ptr, (byte) new_val) !!! The problem is here!!! Strictly speaking not here, but in mutex_spin_wait(): ... mutex_set_waiters(mutex, 1);
/* Try to reserve still a few times */ for (i = 0; i < 4; i++) { if (mutex_test_and_set(mutex) == 0) { ... We'll get stuck if "waiters= 1" will become visible after "if (mutex_test_and_set(mutex))". I don't see anything preventing this reorder.
1. If INNODB_RW_LOCKS_USE_ATOMICS is set, then mytex_set_waiters() is os_compare_and_swap_ulint() which is __sync_bool_compare_and_swap(). This is documented as a full barrier. So it is not possible for "waiters=1" to become visible after "if (mutex_test_and_set(mutex))".
2. If INNODB_RW_LOCKS_USE_ATOMICS is _not_ set, but HAVE_ATOMIC_BUILTINS _is_ set, then os_atomic_test_and_set_byte_acquire() is used in mutex_test_and_set(). This is a full barrier on x86 (despite using the wrong GCC primitive), so should be ok there. On PowerPC it is probably only an acquire barrier, so you are probably right that the "waiters=1" may only become visible after "if (mutex_test_and_set(mutex))". If I understand correctly, this was the same before my patch also.
3. If neither INNODB_RW_LOCKS_USE_ATOMICS nor HAVE_ATOMIC_BUILTINS is set, then mutex_test_and_set() uses os_fast_mutex_trylock_full_barrier(), which again prevents "waiter=1" to be come visible too late, at least on x86. You may be right that the os_mb() in os_fast_mutex_trylock_full_barrier() is no-op in this case, so that PowerPC has only acquire barrier in trylock(), which is insufficent. Again, if I understand correctly, this is unchanged by my patch.
So as I understand it, case (1) is correct for both x86 and PowerPC. Cases (2) and (3) should be correct for x86 also. (2) and (3) may be incorrect for PowerPC, but then they would have been incorrect before also, my patch does not change it.
As you mentioned, it seems likely that problematic cases (2) and (3) do not occur at all in our builds, only working case (1) is used. If so, it seems we do not actually have the problem in current binaries. Maybe we could put in
#ifdef __powerpc__ #error not supported compiler #endif
for cases (2) and (3) to ensure this.
Just to make it clear, I find it absolutely horrible that the InnoDB mutex code requires completely different reasoning to explain its correctness, depending on #ifdefs. It is almost impossible to be sure that one really understands how (and if) it works. That is why I find it so scary to change it in a stable release, and suggested something that does not change the existing, working x86 implementation. Ok so you found extra inconsistencies in this code. Sorry that I wasn't clear enough, but what I meant was...
mutex_set_waiters() calls __sync_bool_compare_and_swap() only in 5.5/XtraDB. In this case it is indeed Ok. But 5.5/InnoDB, 10.0/InnoDB, 10.0/XtraDB just do waiters= 1. I don't see this fixed. Regarding INNODB_RW_LOCKS_USE_ATOMICS - it is abused in mutex code: as it says it is supposed to control rwlocks only. Regards, Sergey
participants (2)
-
Kristian Nielsen
-
Sergey Vojtovich