Sergey Vojtovich <svoj@mariadb.org> writes:
revno: 4346 revision-id: svoj@mariadb.org-20141112114907-thnro6e3kg2ofdsw committer: Sergey Vojtovich <svoj@mariadb.org> timestamp: Wed 2014-11-12 15:49:07 +0400 message: MDEV-7026 - Occasional hang during startup on Power8
=== modified file 'storage/xtradb/sync/sync0sync.c' --- a/storage/xtradb/sync/sync0sync.c 2014-11-03 13:43:44 +0000 +++ b/storage/xtradb/sync/sync0sync.c 2014-11-12 11:49:07 +0000 @@ -467,26 +467,15 @@ mutex_set_waiters( mutex_t* mutex, /*!< in: mutex */ ulint n) /*!< in: value to set */ { -#ifdef INNODB_RW_LOCKS_USE_ATOMICS - ut_ad(mutex); - - if (n) { - os_compare_and_swap_ulint(&mutex->waiters, 0, 1); - } else { - os_compare_and_swap_ulint(&mutex->waiters, 1, 0); - } -#else volatile ulint* ptr; /* declared volatile to ensure that the value is stored to memory */ ut_ad(mutex);
ptr = &(mutex->waiters);
- os_wmb; - *ptr = n; /* Here we assume that the write of a single word in memory is atomic */ -#endif + os_sync; }
This code is still substantially different from what it was in 5.5.39 and 10.0.12, right? I think we need to instead revert the code to what it was before introducing the problem. The InnoDB mutex code clearly does not work by careful design - mutex_reset_lock_word() needs a full barrier but uses what is documented by GCC as only an acquire barrier. Rather, it works by having been tweaked over the years to just happen to work (eg. GCC in fact uses a full barrier on x86). So changing this to be "correct" is very dangerous in a stable release. It is not possible to ensure correctness by testing, either. But it is not necessary to change it. Instead, leave the code as it was before, and use #ifdef __powerpc__ to ensure that it will work also on PowerPC. For example, something like this (just a sketch, I didn't check exactly how the code should look): #ifdef __powerpc__ /* InnoDB mutex implementation needs full memory barrier for this operation */ # define os_atomic_test_and_set_byte(ptr, new_val) \ do { sync(); __sync_lock_test_and_set(ptr, (byte) new_val); } while (0) #else # define os_atomic_test_and_set_byte(ptr, new_val) \ __sync_lock_test_and_set(ptr, (byte) new_val) #endif Then we leave stable releases 5.5 and 10.0 unchanged, reducing the risk of introducing another serious breakage. And we can still fix PowerPC to work correctly. And then perhaps in 10.1 we can do refactoring of the use of memory barriers in InnoDB, if we really want. (Just in case it's not obvious, yes, I _really_ have the InnoDB mutex code). Remember, the 5.5 and 10.0 minor releases appear as security updates in distros, so it is very important to minimise the risk of introducing serious regressions.
Since isync is not a memory barrier and lwsync doesn't guarantee StoreLoad order, the only option we have is full memory barrier.
It's not a problem to have a full barrier here, I think. Because at this point, the waiter has already gone through a lot of spin loops, and now it is going to take a global mutex on the sync wait array, scan the array, and go to sleep with a context switch. One less full barrier is not going to help a lot. To improve this contended case of the InnoDB mutexes, more radical changes are needed.
InnoDB/XtraDB was getting stuck during server startup. In particular (but not limited to) while creating doublewrite buffer.
The stuck during startup is kind of the minor part of the bug. The major part is stalling the server during operation. (The hang during startup was just how the problem was detected in Buildbot). Better clarify in the commit message that this fixes a bug that could cause the entire server to stall (or even hang completely if error monitor thread itself got stuck). This will help people who experienced the problem and look at the changelog.
It happens because mutex_exit() didn't wake up waiters of affected mutex. This is kind of acceptable because error monitor thread is supposed to wake up such waiters. But since error monitor thread hasn't been started that early, nobody did inform waiting threads that mutex was released.
I do not think the error monitor is "supposed to wake up such waiters". In fact, it seems clear that the error monitor wakeup is _not_ used, at least not in commonly used binaries. Because it just never worked - it itself needs to take mutexes, which will prevent it from doing the wakeup. So if it was needed, someone would have noticed this problem. The symptom is _very_ hard to miss, as the whole server gets stuck hard, unable to do anything. It really is not acceptable to have a mutex implementation that can miss wakeups. This can stall the entire server. Even if the error monitor resolves the stall after 1 second, that is still unacceptable for many applications. Hope this helps, - Kristian.