At http://bazaar.launchpad.net/~maria-captains/maria/5.5
------------------------------------------------------------ revno: 4367 revision-id: knielsen@knielsen-hq.org-20141119125646-njq02h7irmu9s71b parent: sergii@pisem.net-20141118231952-fk3o69075dn3ravv committer: Kristian Nielsen <knielsen@knielsen-hq.org> branch nick: mariadb-5.5 timestamp: Wed 2014-11-19 13:56:46 +0100 message: MDEV-7026: Race in InnoDB/XtraDB mutex implementation can stall or hang the server.
The bug was that full memory barrier was missing in the code that ensures that a waiter on an InnoDB mutex will not go to sleep unless it is guaranteed to be woken up again by another thread currently holding the mutex. This made possible a race where a thread could get stuck waiting for a mutex that is in fact no longer locked. If that thread was also holding other critical locks, this could stall the entire server. There is an error monitor thread than can break the stall, it runs about once per second. But if the error monitor thread itself got stuck or was not running, then the entire server could hang infinitely.
This was introduced on i386/amd64 platforms in 5.5.40 and 10.0.13 by an incorrect patch that tried to fix the similar problem for PowerPC. JFYI: Strictly speaking that "incorrect patch" attempted to fix different
Hi Kristian, thanks for taking over this bug! I believe this is the patch that was pushed into 5.5. Unfortunately I think there is still one potential for a hang left (details below). :( Since InnoDB and XtraDB changes are identical I'll comment only XtraDB changes. So you removed os_isync and implemented os_mb instead. That's good! You also removed a bunch of os_wmb/os_rmb calls. That's good too. And you kept RELEASE barrier for mutex_exit() on PPC64. Very good! I will also comment things that I don't really like, but they're just for your information (no action required). On Wed, Nov 19, 2014 at 01:56:46PM +0100, knielsen@knielsen-hq.org wrote: problem (read carefully): it added RELEASE barrier between "stores inside mutex" and "lock_word= 0" while this patch adds full barrier between "lock_word= 0" and "if (waiters)".
This commit reverts the incorrect PowerPC patch, and instead implements a fix for PowerPC that does not change i386/amd64 behaviour, making PowerPC work similarly to i386/amd64.
=== modified file 'storage/xtradb/include/os0sync.h' --- a/storage/xtradb/include/os0sync.h 2014-09-08 15:10:48 +0000 +++ b/storage/xtradb/include/os0sync.h 2014-11-19 12:56:46 +0000 @@ -314,11 +331,28 @@ amount of increment. */ /**********************************************************//** Returns the old value of *ptr, atomically sets *ptr to new_val */
-# define os_atomic_test_and_set_byte(ptr, new_val) \ +#ifdef __powerpc__ +/* + os_atomic_test_and_set_byte_release() should imply a release barrier before + setting, and a full barrier after. But __sync_lock_test_and_set() is only + documented as an aquire barrier. So on PowerPC we need to add the full + barrier explicitly. */ +# define os_atomic_test_and_set_byte_release(ptr, new_val) \ + do { __sync_lock_release(ptr); \ + __sync_synchronize(); } while (0) +#else +/* + On x86, __sync_lock_test_and_set() happens to be full barrier, due to + LOCK prefix. +*/ +# define os_atomic_test_and_set_byte_release(ptr, new_val) \ + __sync_lock_test_and_set(ptr, (byte) new_val) +#endif JFYI: "os_atomic_test_and_set_byte_release" sounds odd to me. Look at gcc manual: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html ... This builtin is not a full barrier, but rather an acquire barrier. This means
...skip... that references after the builtin cannot move to (or be speculated to) before the builtin, but previous memory stores may not be globally visible yet, and previous memory loads may not yet be satisfied. ... You name RELEASE something that is acqually ACQUIRE. So why didn't you like "os_atomic_lock_release_byte" which is RELASE btw: ... This builtin is not a full barrier, but rather a release barrier. This means that all previous memory stores are globally visible, and all previous memory loads have been satisfied, but following memory reads are not prevented from being speculated to before the barrier. ...
+/* + os_atomic_test_and_set_byte_acquire() is a full memory barrier on x86. But + in general, just an aquire barrier should be sufficient. */ +# define os_atomic_test_and_set_byte_acquire(ptr, new_val) \ __sync_lock_test_and_set(ptr, (byte) new_val) !!! The problem is here!!! Strictly speaking not here, but in mutex_spin_wait(): ... mutex_set_waiters(mutex, 1);
=== modified file 'storage/xtradb/include/sync0sync.ic' --- a/storage/xtradb/include/sync0sync.ic 2014-08-29 12:02:46 +0000 +++ b/storage/xtradb/include/sync0sync.ic 2014-11-19 12:56:46 +0000 @@ -80,11 +80,11 @@ mutex_test_and_set( mutex_t* mutex) /*!< in: mutex */ { #if defined(HAVE_ATOMIC_BUILTINS) - return(os_atomic_test_and_set_byte(&mutex->lock_word, 1)); + return(os_atomic_test_and_set_byte_acquire(&mutex->lock_word, 1)); #else ibool ret;
- ret = os_fast_mutex_trylock(&(mutex->os_fast_mutex)); + ret = os_fast_mutex_trylock_full_barrier(&(mutex->os_fast_mutex));
if (ret == 0) { /* We check that os_fast_mutex_trylock does not leak JFYI: This is probably the most funny thing. Here we're in the "#else" branch of "#if defined(HAVE_ATOMIC_BUILTINS)". E.g. if we compile with gcc and enter this branch it means: sync builtins are not avialble (gcc atomic builtins are very
/* Try to reserve still a few times */ for (i = 0; i < 4; i++) { if (mutex_test_and_set(mutex) == 0) { ... We'll get stuck if "waiters= 1" will become visible after "if (mutex_test_and_set(mutex))". I don't see anything preventing this reorder. ...skip... likely not available too). Now you say "os_fast_mutex_trylock_full_barrier" which does os_mb. But since neither sync bultins nor atomic builtins are available, os_mb is basically no-op. This may be different on other platforms, but to my knowledge memory barriers and atomic operations normally come together. Another thing: did you try compiling with this branch enabled? IIRC it didn't compile even berfore we started fixing it. ...skip... Regards, Sergey