Hi Kristian, thanks for you feedback. Answers inline. On Fri, Nov 14, 2014 at 11:23:13AM +0100, Kristian Nielsen wrote:
Sergey Vojtovich <svoj@mariadb.org> writes:
revno: 4346 revision-id: svoj@mariadb.org-20141112114907-thnro6e3kg2ofdsw committer: Sergey Vojtovich <svoj@mariadb.org> timestamp: Wed 2014-11-12 15:49:07 +0400 message: MDEV-7026 - Occasional hang during startup on Power8
=== modified file 'storage/xtradb/sync/sync0sync.c' --- a/storage/xtradb/sync/sync0sync.c 2014-11-03 13:43:44 +0000 +++ b/storage/xtradb/sync/sync0sync.c 2014-11-12 11:49:07 +0000 @@ -467,26 +467,15 @@ mutex_set_waiters( mutex_t* mutex, /*!< in: mutex */ ulint n) /*!< in: value to set */ { -#ifdef INNODB_RW_LOCKS_USE_ATOMICS - ut_ad(mutex); - - if (n) { - os_compare_and_swap_ulint(&mutex->waiters, 0, 1); - } else { - os_compare_and_swap_ulint(&mutex->waiters, 1, 0); - } -#else volatile ulint* ptr; /* declared volatile to ensure that the value is stored to memory */ ut_ad(mutex);
ptr = &(mutex->waiters);
- os_wmb; - *ptr = n; /* Here we assume that the write of a single word in memory is atomic */ -#endif + os_sync; }
This code is still substantially different from what it was in 5.5.39 and 10.0.12, right? I think we need to instead revert the code to what it was before introducing the problem.
The InnoDB mutex code clearly does not work by careful design - mutex_reset_lock_word() needs a full barrier but uses what is documented by GCC as only an acquire barrier. Rather, it works by having been tweaked over the years to just happen to work (eg. GCC in fact uses a full barrier on x86).
So changing this to be "correct" is very dangerous in a stable release. It is not possible to ensure correctness by testing, either.
But it is not necessary to change it. Instead, leave the code as it was before, and use #ifdef __powerpc__ to ensure that it will work also on PowerPC.
For example, something like this (just a sketch, I didn't check exactly how the code should look):
#ifdef __powerpc__ /* InnoDB mutex implementation needs full memory barrier for this operation */ # define os_atomic_test_and_set_byte(ptr, new_val) \ do { sync(); __sync_lock_test_and_set(ptr, (byte) new_val); } while (0) #else # define os_atomic_test_and_set_byte(ptr, new_val) \ __sync_lock_test_and_set(ptr, (byte) new_val) #endif
Then we leave stable releases 5.5 and 10.0 unchanged, reducing the risk of introducing another serious breakage. And we can still fix PowerPC to work correctly.
And then perhaps in 10.1 we can do refactoring of the use of memory barriers in InnoDB, if we really want. (Just in case it's not obvious, yes, I _really_ have the InnoDB mutex code).
Remember, the 5.5 and 10.0 minor releases appear as security updates in distros, so it is very important to minimise the risk of introducing serious regressions. I think your idea of fixing this is acceptable. I'll send another patch soon.
Since isync is not a memory barrier and lwsync doesn't guarantee StoreLoad order, the only option we have is full memory barrier.
It's not a problem to have a full barrier here, I think. Because at this point, the waiter has already gone through a lot of spin loops, and now it is going to take a global mutex on the sync wait array, scan the array, and go to sleep with a context switch. One less full barrier is not going to help a lot. To improve this contended case of the InnoDB mutexes, more radical changes are needed.
Look at the cset comment: every mutex_exit() has to issue full memory barrier unconditionally! But I benchmarked this patch and it worked slightly faster on Power8. I suppose because stale threads don't have to wait a second till wakeup from error monitor thread.
InnoDB/XtraDB was getting stuck during server startup. In particular (but not limited to) while creating doublewrite buffer.
The stuck during startup is kind of the minor part of the bug. The major part is stalling the server during operation. (The hang during startup was just how the problem was detected in Buildbot).
Better clarify in the commit message that this fixes a bug that could cause the entire server to stall (or even hang completely if error monitor thread itself got stuck). This will help people who experienced the problem and look at the changelog.
A few quotes from a different thread: ... Kristian> 2. In a user hang I investigated in detail, a thread was waiting on Kristian> log_sys->mutex, but no other thread was holding that mutex (consistent with Kristian> lost wakeup). The server error monitor thread was itself stuck on a mutex, Kristian> which would prevent sync_arr_wake_threads_if_sema_free() from hiding the Kristian> problem. me> Strange... Monty should have fixed this. Error monitor thread should call me> log_get_lsn_nowait(), which basically does trylock. Do you happen to have call me> trace? Kristian> This investigation was some time ago (maybe 1-2 months). It seems likely that Kristian> this was a version before Monty's log_get_lsn_nowait() (eg. 10.0.14 has it, Kristian> but not 10.0.13). ... me> According to history ACQUIRE -> RELEASE fix appeared in 10.0.13 and fix for me> log_get_lsn() appeared in 10.0.14. Both fixes appeared similtaneously in 5.5.40. me> me> So runtime hangs should be solved both in 5.5 and 10.0. This leaves hangs during me> startup, which are unfortunate but not as critical as runtime hangs. me> me> Are there any other known hangs? Kristian> If you are going to argue for deliberately keeping broken mutex_enter() / Kristian> mutex_exit() in MariaDB, I will stay out of the discussion. I was only trying Kristian> to get Buildbot green so that I can push my completely unrelated fixes for Kristian> 10.0. ... Stating that this patch fixes run-time hangs that I'm not aware of is kind of strange. So I repeat my question: Are there any other known hangs?
It happens because mutex_exit() didn't wake up waiters of affected mutex. This is kind of acceptable because error monitor thread is supposed to wake up such waiters. But since error monitor thread hasn't been started that early, nobody did inform waiting threads that mutex was released.
I do not think the error monitor is "supposed to wake up such waiters". In fact, it seems clear that the error monitor wakeup is _not_ used, at least not in commonly used binaries. Because it just never worked - it itself needs to take mutexes, which will prevent it from doing the wakeup. So if it was needed, someone would have noticed this problem. The symptom is _very_ hard to miss, as the whole server gets stuck hard, unable to do anything.
It really is not acceptable to have a mutex implementation that can miss wakeups. This can stall the entire server. Even if the error monitor resolves the stall after 1 second, that is still unacceptable for many applications.
Could you suggest better wording for cset comment? Thanks, Sergey