[Maria-developers] Memory barrier problem in InnoDB/XtraDB mutex implementation causing hangs
I have been debugging some test failures seen in Buildbot: https://mariadb.atlassian.net/browse/MDEV-7026 After some debugging, this is what I think is going on. I will refer to MariaDB 5.5 sources, though the problem is also in 10.0. InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in mutex_exit_func(). Simplifying for clarity, here are the relevant parts of the logic: ENTER: # First spin wait, hoping to get the mutex without yielding. os_compare_and_swap_ulint(&mutex->waiters, 0, 1); if (os_atomic_test_and_set_byte(&mutex->lock_word, 1)) sync_array_wait_event() RELEASE: os_atomic_lock_release_byte(&mutex->lock_word); os_rmb; waiters = *(volatile *)&mutex->waiters; if (waiters) mutex_signal_object(mutex); The interesting point here is how to ensure that we do not lose the wakeup of a waiting thread when the holding thread releases the mutex. The idea is that the waiting thread first sets mutex->waiters=1. Then it checks if the mutex is free - if not, it goes to sleep. The thread that releases the lock first marks the mutex free. Then it checks if there are any waiters, and if so wakes them up. For this to work, we need to ensure that the setting of waiters=1 has completed before the load of lock_word can start - a full memory barrier. Likewise, we need the setting of lock_word to 0 to complete before the load of mutex->waiters - again a full memory barrier. Otherwise we can get something like this: ENTER RELEASE Start store to mutex->waiters Start store to lock_word Load mutex->lock_word Store to lock_word becomes visible Load mutex->waiters Store to waiters becomes visible Check lock word==1 Check waiters==0, skip wakeup Sleep (forever) Or this: ENTER RELEASE Start store to lock_word Start store to mutex->waiters Load mutex->waiters Store to waiters becomes visible Load mutex->lock_word Store to lock_word becomes visible Check lock word==1 Sleep (forever) Check waiters==0, skip wakeup (So both full barriers are needed, the one in ENTER to prevent the first example, the one in RELEASE to prevent the second). It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and 10.0.13. os_compare_and_swap_ulint() is defined as either __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a full barrier. So ENTER looks ok. os_atomic_lock_release_byte() however is now defined as __sync_lock_release(). This is _not_ a full barrier, just a release barrier. So the RELEASE case seems to be incorrect, allowing the second deadlock example to occur. Previously, RELEASE used os_atomic_test_and_set_byte(), which is __sync_lock_test_and_set(). This is defined in GCC docs as an "aquire" barrier only, which is also wrong, obviously. But it happens to generate an XCHG instruction on x86, which is a full barrier. This is probably why it works on x86 (and probably why it did not work on PowerPC). Here is a small test program, and the assembly from GCC 5.7 on amd64: void test(unsigned *p, unsigned *q) { while (__sync_lock_test_and_set(q, 1)) { } *p++; __sync_lock_release(q); } test: movl $1, %edx .L2: movl %edx, %eax xchgl (%rsi), %eax testl %eax, %eax jne .L2 movl $0, (%rsi) ret As can be seen, the __sync_lock_test_and_set() generates a full barrier, but the __sync_lock_release() generates nothing (as this is sufficient for release semantics for the rather strong memory order on x86). So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and 10.0.13 :-( This is consistent with the observations in MDEV-7026. A thread is waiting for wakeup on a mutex that is already marked free. Apparently this is not a new problem. This comment in mutex_exit_func() makes me want to cry: /* A problem: we assume that mutex_reset_lock word is a memory barrier, that is when we read the waiters field next, the read must be serialized in memory after the reset. A speculative processor might perform the read first, which could leave a waiting thread hanging indefinitely. Our current solution call every second sync_arr_wake_threads_if_sema_free() to wake up possible hanging threads if they are missed in mutex_signal_object. */ So someone found the mutex implementation broken, but instead of fixing it, they implemented a horrible workaround to try and break any deadlocks that might occur. (The reason that the deadlocks are not broken in the MDEV-7026 seems to be that it occurs during startup, before starting the srv_error_monitor_thread() that does the sync_arr_wake_threads_if_sema_free()). Now, what is also interesting is that this seems to explain very nicely the many server lockups we have seen for users that upgrade to recent MariaDB versions. I was doing a very detailed investigation of one such case where the user was able to supply unusually detailed information. I found a thread that was stuck waiting on a mutex, but no other threads seemed to be holding that mutex. This is exactly what would be seen in case of a lost wakeup due to this issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as well as the normal InnoDB "long semaphore wait", was itself blocked on that same mutex wait, which explains why they were not effective for the user in that particular case. So it looks like we need to revert the incorrect patch and release new 5.5.40a and 10.0.15 releases ASAP, to solve these hangs for users. And do a proper patch for PowerPC with proper full memory barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some PowerPC patch, it also looks completely wrong). However, more importantly, this issue clearly shows some fundamental problems in our development procedures that we need to fix ASAP. Just look at it. We have users (and customers!) that suffer from strange hangs. This is their production systems that are down. Support and developers are unable to help them due to the difficulty of debugging rare race conditions in production systems. And all the time the bug was staring us in the face, easily reproducable in our own Buildbot! But we deliberately chose to ignore those failures. The only thing that matters is "when can you push", and getting more releases out. Never mind what kind of crap is then pushed out to users. I am deeply ashamed of treating our users like this. And extremely unhappy. I even wrote about this (the importance of not ignoring Buildbot failures) just a few weeks ago. This was completely ignored, I got zero replies, and the weekly calls continued with mindless juggling of Jira numbers and buggy releases. As a result, our users suffer. We need to stop this. Please? - Kristian.
Hi Kristian, On Thu, Nov 06, 2014 at 03:12:55PM +0100, Kristian Nielsen wrote:
I have been debugging some test failures seen in Buildbot:
https://mariadb.atlassian.net/browse/MDEV-7026
After some debugging, this is what I think is going on. I will refer to MariaDB 5.5 sources, though the problem is also in 10.0.
InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in mutex_exit_func().
Simplifying for clarity, here are the relevant parts of the logic:
ENTER:
# First spin wait, hoping to get the mutex without yielding. os_compare_and_swap_ulint(&mutex->waiters, 0, 1); Where did you see "os_compare_and_swap_ulint(&mutex->waiters, 0, 1)"? FWICS waiters of InnoDB mutex are non-atomic loads/stores.
if (os_atomic_test_and_set_byte(&mutex->lock_word, 1)) sync_array_wait_event()
RELEASE:
os_atomic_lock_release_byte(&mutex->lock_word); os_rmb; waiters = *(volatile *)&mutex->waiters; if (waiters) mutex_signal_object(mutex);
The interesting point here is how to ensure that we do not lose the wakeup of a waiting thread when the holding thread releases the mutex.
The idea is that the waiting thread first sets mutex->waiters=1. Then it checks if the mutex is free - if not, it goes to sleep.
The thread that releases the lock first marks the mutex free. Then it checks if there are any waiters, and if so wakes them up.
For this to work, we need to ensure that the setting of waiters=1 has completed before the load of lock_word can start - a full memory barrier. Likewise, we need the setting of lock_word to 0 to complete before the load of mutex->waiters - again a full memory barrier. Otherwise we can get something like this:
ENTER RELEASE Start store to mutex->waiters Start store to lock_word Load mutex->lock_word Store to lock_word becomes visible Load mutex->waiters Store to waiters becomes visible Check lock word==1 Check waiters==0, skip wakeup Sleep (forever)
Or this:
ENTER RELEASE Start store to lock_word Start store to mutex->waiters Load mutex->waiters Store to waiters becomes visible Load mutex->lock_word Store to lock_word becomes visible Check lock word==1 Sleep (forever) Check waiters==0, skip wakeup
(So both full barriers are needed, the one in ENTER to prevent the first example, the one in RELEASE to prevent the second). Agree so far.
It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and 10.0.13.
os_compare_and_swap_ulint() is defined as either __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a full barrier. So ENTER looks ok.
See above: waiters loads/stores are not atomic, so no memory barriers at all.
os_atomic_lock_release_byte() however is now defined as __sync_lock_release(). This is _not_ a full barrier, just a release barrier. So the RELEASE case seems to be incorrect, allowing the second deadlock example to occur.
Previously, RELEASE used os_atomic_test_and_set_byte(), which is __sync_lock_test_and_set(). This is defined in GCC docs as an "aquire" barrier only, which is also wrong, obviously. But it happens to generate an XCHG instruction on x86, which is a full barrier. This is probably why it works on x86 (and probably why it did not work on PowerPC).
Here is a small test program, and the assembly from GCC 5.7 on amd64:
void test(unsigned *p, unsigned *q) { while (__sync_lock_test_and_set(q, 1)) { } *p++; __sync_lock_release(q); }
test: movl $1, %edx .L2: movl %edx, %eax xchgl (%rsi), %eax testl %eax, %eax jne .L2 movl $0, (%rsi) ret
As can be seen, the __sync_lock_test_and_set() generates a full barrier, but the __sync_lock_release() generates nothing (as this is sufficient for release semantics for the rather strong memory order on x86).
So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and 10.0.13 :-(
This is consistent with the observations in MDEV-7026. A thread is waiting for wakeup on a mutex that is already marked free.
Ok, so you say this hang is now repeatable with x86 because we changed ACQUIRE -> RELEASE. I assume you were actually able to reproduce it on x86. Note that we also added mysterious os_isync to mutex_exit_func, which is supposed to be among other things ACQUIRE. But on x86 it is no-op with a comment: /* Performance regression was observed at some conditions for Intel architecture. Disable memory barrier for Intel architecture for now. */
Apparently this is not a new problem. This comment in mutex_exit_func() makes me want to cry:
/* A problem: we assume that mutex_reset_lock word is a memory barrier, that is when we read the waiters field next, the read must be serialized in memory after the reset. A speculative processor might perform the read first, which could leave a waiting thread hanging indefinitely.
Our current solution call every second sync_arr_wake_threads_if_sema_free() to wake up possible hanging threads if they are missed in mutex_signal_object. */
Hehe, you're not alone who cried about this comment. :)
So someone found the mutex implementation broken, but instead of fixing it, they implemented a horrible workaround to try and break any deadlocks that might occur.
(The reason that the deadlocks are not broken in the MDEV-7026 seems to be that it occurs during startup, before starting the srv_error_monitor_thread() that does the sync_arr_wake_threads_if_sema_free()).
Now, what is also interesting is that this seems to explain very nicely the many server lockups we have seen for users that upgrade to recent MariaDB versions. I was doing a very detailed investigation of one such case where the user was able to supply unusually detailed information. I found a thread that was stuck waiting on a mutex, but no other threads seemed to be holding that mutex. This is exactly what would be seen in case of a lost wakeup due to this issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as well as the normal InnoDB "long semaphore wait", was itself blocked on that same mutex wait, which explains why they were not effective for the user in that particular case.
Does it explain server lockups at startup or during normal operations? Could you elaborate the problem you're talking about?
So it looks like we need to revert the incorrect patch and release new 5.5.40a and 10.0.15 releases ASAP, to solve these hangs for users.
And do a proper patch for PowerPC with proper full memory barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some PowerPC patch, it also looks completely wrong).
I believe reverting this patch is too late: we already claimed compatibility with Power8. I'd vote for putting full memory barrier into mutex_exit_func() and probably remove that ugly hack with sync_arr_wake_threads_if_sema_free. :( Though a proper solution should never issue full memory barriers for mutex lock/unlock. I suggested different idea, but that's for 10.1: https://mariadb.atlassian.net/browse/MDEV-6654 That's unfortunate, but we attempted to fix wrong thing by wrong patch. I was kind of against this: https://mariadb.atlassian.net/browse/MDEV-6515 <quot> - mutex_get_waiters() miss acquire memory barrier. This may cause mutex_exit_func() read stale 'waiters' value and be the reason for deadlock. There seem to be a workaround for that: srv_error_monitor_thread() is supposed to wake these stale threads every second. But if that's the case, we don't really need release memory barrier in mutex_set_waiters(). </quot> s/acquire memory barrier/full memory barrier/
However, more importantly, this issue clearly shows some fundamental problems in our development procedures that we need to fix ASAP.
Just look at it. We have users (and customers!) that suffer from strange hangs. This is their production systems that are down. Support and developers are unable to help them due to the difficulty of debugging rare race conditions in production systems.
And all the time the bug was staring us in the face, easily reproducable in our own Buildbot! But we deliberately chose to ignore those failures. The only thing that matters is "when can you push", and getting more releases out. Never mind what kind of crap is then pushed out to users.
I am deeply ashamed of treating our users like this. And extremely unhappy. I even wrote about this (the importance of not ignoring Buildbot failures) just a few weeks ago. This was completely ignored, I got zero replies, and the weekly calls continued with mindless juggling of Jira numbers and buggy releases.
As a result, our users suffer.
We need to stop this.
Please?
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Regards, Sergey
Sergey Vojtovich <svoj@mariadb.org> writes:
# First spin wait, hoping to get the mutex without yielding. os_compare_and_swap_ulint(&mutex->waiters, 0, 1); Where did you see "os_compare_and_swap_ulint(&mutex->waiters, 0, 1)"? FWICS waiters of InnoDB mutex are non-atomic loads/stores.
mutex_enter_func() calls mutex_spin_wait() which does mutex_set_waiters(mutex, 1): mutex_set_waiters(mutex_t *mutex, ulint n) { #ifdef INNODB_RW_LOCKS_USE_ATOMICS if (n) { os_compare_and_swap_ulint(&mutex->waiters, 0, 1); } else { os_compare_and_swap_ulint(&mutex->waiters, 1, 0); } So that is where I saw it. I do not understand what you mean with "waiters of InnoDB mutex are non-atomic loads/stores" ?
os_compare_and_swap_ulint() is defined as either __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a full barrier. So ENTER looks ok. See above: waiters loads/stores are not atomic, so no memory barriers at all.
Well, the code is hopelessly complicated with various #ifdef cases. I do not claim to have verified all cases that they have the correct barrier for ENTER. However, I found that in my particular build, mutex_set_waiters used the INNODB_RW_LOCKS_USE_ATOMICS case, which uses os_compare_and_swap_ulint(). And that is defined in os0sync.h as either __sync_bool_compare_and_swap(), which is documented as having a full barrier; or atomic_cas_ulong(), which I think is the MySQL's own atomics, which on x86 were implemented using LOCK prefix last time I looked, so also full barrier. Why do you think there is no full barrier? Did I miss something?
This is consistent with the observations in MDEV-7026. A thread is waiting for wakeup on a mutex that is already marked free. Ok, so you say this hang is now repeatable with x86 because we changed ACQUIRE -> RELEASE. I assume you were actually able to reproduce it on x86.
I have not tried reproducing on x86. Note that on x86 we changed it from full barrier (XCHG instruction) to aquire (which does nothing if my test program is to be believed). The documentation only guarantees AQUIRE for __sync_lock_test_and_set(), true, but in reality it seems to have been full barrier on x86, which would explain why it worked before. I am speculating that some of the hangs that started to appear recently could be due to this change. At least one case, that I researched in considerable depth, can be explained adequately in this way. It seems likely that this is the explanation, but I cannot be sure.
Note that we also added mysterious os_isync to mutex_exit_func, which is supposed to be among other things ACQUIRE. But on x86 it is no-op with a comment: /* Performance regression was observed at some conditions for Intel architecture. Disable memory barrier for Intel architecture for now. */
I saw the added isync. I think this is also wrong. isync is about speculative execution, one can use it with a conditional instruction to prevent instructions being executed ahead of the completion of a load. But the place I saw, it was used incorrectly, without the conditional instruction. I did not dig deeper into this though, as even if used correctly it would still not be the full barrier that seems to be required here.
Does it explain server lockups at startup or during normal operations? Could you elaborate the problem you're talking about?
So my guess is that on x86, the full barriers are in place, and the mutex code works without losing wakeups. However, at some point there was some problem, maybe some specific compiler or architecture or whatever, and the sync_arr_wake_threads_if_sema_free() hack was implemented. I am guessing that things work correctly on x86 because the sync_arr_wake_threads_if_sema_free() is fundamentally broken. It is not active during startup (and maybe shutdown?). And it does not work if the server error monitor thread happens to also get stuck on a mutex. So I think missing barriers on x86 would have been discovered in commonly used binaries. And the sync_arr_wake_threads_if_sema_free() has been able to hide problems in rare architectures. That is my guess. As to the concrete problems: 1. MDEV-7026 there is a lockup during startup. I believe this is visible because the server error monitor thread is not running at this point. 2. In a user hang I investigated in detail, a thread was waiting on log_sys->mutex, but no other thread was holding that mutex (consistent with lost wakeup). The server error monitor thread was itself stuck on a mutex, which would prevent sync_arr_wake_threads_if_sema_free() from hiding the problem.
with Power8. I'd vote for putting full memory barrier into mutex_exit_func() and probably remove that ugly hack with sync_arr_wake_threads_if_sema_free. :(
Yes, I agree, that is what I meant. I agree that it was probably broken on PowerPC also before the change. But it looked to me like we had full barriers on x86 before the change (even if it was not guaranteed by documentation).
Though a proper solution should never issue full memory barriers for mutex lock/unlock. I suggested different idea, but that's for 10.1:
This sounds very strange. Are you sure? Normally, mutex lock/unlock is assumed to be a full barrier. If my reading of the code is correct, this is also how it was implemented on x86. It seems _very_ likely that part of the InnoDB code relies on such full barriers. It does not seem safe to me to change it. Posix thread operations are for example defined to imply full barrier. Or do you mean that there should be _one_ full barrier implied in mutex_enter(), and one in mutex_exit()? But there are currently more than one, and this should be optimised? That would make sense. Though my personal opinion is that the InnoDB mutex implementation has always been a piece of garbage, and rather than mess more with it, it should be replaced with something simpler and better (isn't this what the InnoDB team is already working on?). - Kristian.
Kristian, thanks for your answers! My comments inline. On Thu, Nov 06, 2014 at 06:48:18PM +0100, Kristian Nielsen wrote:
Sergey Vojtovich <svoj@mariadb.org> writes:
# First spin wait, hoping to get the mutex without yielding. os_compare_and_swap_ulint(&mutex->waiters, 0, 1); Where did you see "os_compare_and_swap_ulint(&mutex->waiters, 0, 1)"? FWICS waiters of InnoDB mutex are non-atomic loads/stores.
mutex_enter_func() calls mutex_spin_wait() which does mutex_set_waiters(mutex, 1):
mutex_set_waiters(mutex_t *mutex, ulint n) { #ifdef INNODB_RW_LOCKS_USE_ATOMICS if (n) { os_compare_and_swap_ulint(&mutex->waiters, 0, 1); } else { os_compare_and_swap_ulint(&mutex->waiters, 1, 0); }
So that is where I saw it. I do not understand what you mean with "waiters of InnoDB mutex are non-atomic loads/stores" ? Never seen this before. InnoDB of 5.5 and InnoDB/XtraDB of 10.0 both have:
UNIV_INTERN void mutex_set_waiters( /*==============*/ ib_mutex_t* mutex, /*!< in: mutex */ ulint n) /*!< in: value to set */ { volatile ulint* ptr; /* declared volatile to ensure that the value is stored to memory */ ut_ad(mutex); ptr = &(mutex->waiters); os_wmb; *ptr = n; /* Here we assume that the write of a single word in memory is atomic */ } But I confirm that I see this in XtraDB of 5.5.
os_compare_and_swap_ulint() is defined as either __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a full barrier. So ENTER looks ok. See above: waiters loads/stores are not atomic, so no memory barriers at all.
Well, the code is hopelessly complicated with various #ifdef cases. I do not claim to have verified all cases that they have the correct barrier for ENTER.
However, I found that in my particular build, mutex_set_waiters used the INNODB_RW_LOCKS_USE_ATOMICS case, which uses os_compare_and_swap_ulint(). And that is defined in os0sync.h as either __sync_bool_compare_and_swap(), which is documented as having a full barrier; or atomic_cas_ulong(), which I think is the MySQL's own atomics, which on x86 were implemented using LOCK prefix last time I looked, so also full barrier.
Why do you think there is no full barrier? Did I miss something?
You're right. I didn't check XtraDB of 5.5, I did check InnoDB instead.
This is consistent with the observations in MDEV-7026. A thread is waiting for wakeup on a mutex that is already marked free. Ok, so you say this hang is now repeatable with x86 because we changed ACQUIRE -> RELEASE. I assume you were actually able to reproduce it on x86.
I have not tried reproducing on x86.
Note that on x86 we changed it from full barrier (XCHG instruction) to aquire (which does nothing if my test program is to be believed). The documentation only guarantees AQUIRE for __sync_lock_test_and_set(), true, but in reality it seems to have been full barrier on x86, which would explain why it worked before.
I am speculating that some of the hangs that started to appear recently could be due to this change. At least one case, that I researched in considerable depth, can be explained adequately in this way. It seems likely that this is the explanation, but I cannot be sure.
Agree. Hangs can happen in the following situations: - error monitor thread is not started (that's mostly for startup) - error monitor thread is stuck itself (can happen any time)
Note that we also added mysterious os_isync to mutex_exit_func, which is supposed to be among other things ACQUIRE. But on x86 it is no-op with a comment: /* Performance regression was observed at some conditions for Intel architecture. Disable memory barrier for Intel architecture for now. */
I saw the added isync. I think this is also wrong. isync is about speculative execution, one can use it with a conditional instruction to prevent instructions being executed ahead of the completion of a load. But the place I saw, it was used incorrectly, without the conditional instruction. I did not dig deeper into this though, as even if used correctly it would still not be the full barrier that seems to be required here.
Agree. What we basically do is: lock_word= 0; // atomic store with release barrier isync; if (waiters) // non-atomic load Here we need to ensure that store completes before load. But: - there is no isync on x86 - isync doesn't guarantee StoreLoad order (there are many articles about this) So full memory barrier seem to be the only option.
Does it explain server lockups at startup or during normal operations? Could you elaborate the problem you're talking about?
So my guess is that on x86, the full barriers are in place, and the mutex code works without losing wakeups. However, at some point there was some problem, maybe some specific compiler or architecture or whatever, and the sync_arr_wake_threads_if_sema_free() hack was implemented. I am guessing that things work correctly on x86 because the sync_arr_wake_threads_if_sema_free() is fundamentally broken. It is not active during startup (and maybe shutdown?). And it does not work if the server error monitor thread happens to also get stuck on a mutex. So I think missing barriers on x86 would have been discovered in commonly used binaries. And the sync_arr_wake_threads_if_sema_free() has been able to hide problems in rare architectures. That is my guess.
We're on the same wave.
As to the concrete problems:
1. MDEV-7026 there is a lockup during startup. I believe this is visible because the server error monitor thread is not running at this point.
Agree.
2. In a user hang I investigated in detail, a thread was waiting on log_sys->mutex, but no other thread was holding that mutex (consistent with lost wakeup). The server error monitor thread was itself stuck on a mutex, which would prevent sync_arr_wake_threads_if_sema_free() from hiding the problem.
Strange... Monty should have fixed this. Error monitor thread should call log_get_lsn_nowait(), which basically does trylock. Do you happen to have call trace?
with Power8. I'd vote for putting full memory barrier into mutex_exit_func() and probably remove that ugly hack with sync_arr_wake_threads_if_sema_free. :(
Yes, I agree, that is what I meant.
I agree that it was probably broken on PowerPC also before the change. But it looked to me like we had full barriers on x86 before the change (even if it was not guaranteed by documentation).
Correct. It was dead-broken on PPC and it had barrier on x86 before.
Though a proper solution should never issue full memory barriers for mutex lock/unlock. I suggested different idea, but that's for 10.1:
This sounds very strange. Are you sure? Normally, mutex lock/unlock is assumed to be a full barrier. If my reading of the code is correct, this is also how it was implemented on x86. It seems _very_ likely that part of the InnoDB code relies on such full barriers. It does not seem safe to me to change it.
Posix thread operations are for example defined to imply full barrier.
Or do you mean that there should be _one_ full barrier implied in mutex_enter(), and one in mutex_exit()? But there are currently more than one, and this should be optimised? That would make sense. Though my personal opinion is that the InnoDB mutex implementation has always been a piece of garbage, and rather than mess more with it, it should be replaced with something simpler and better (isn't this what the InnoDB team is already working on?).
Do you have a link to articles that define full memory barrier for mutex lock/unlock? Probably that was quite outdated writing. See e.g.: http://en.cppreference.com/w/cpp/atomic/memory_order ...Atomic load with memory_order_acquire or stronger is an acquire operation. The lock() operation on a Mutex is also an acquire operation... ...Atomic store with memory_order_release or stronger is a release operation. The unlock() operation on a Mutex is also a release operation... Full memory barriers are way too heavy even for mutexes. All we need to to is to ensure that: - all loads following lock are performed after lock (acquire) - all stores preceding unlock are completed before unlock (release) Regards, Sergey
Sergey Vojtovich <svoj@mariadb.org> writes:
Never seen this before. InnoDB of 5.5 and InnoDB/XtraDB of 10.0 both have:
UNIV_INTERN void mutex_set_waiters( /*==============*/ ib_mutex_t* mutex, /*!< in: mutex */ ulint n) /*!< in: value to set */ { volatile ulint* ptr; /* declared volatile to ensure that the value is stored to memory */ ut_ad(mutex);
ptr = &(mutex->waiters);
os_wmb;
*ptr = n; /* Here we assume that the write of a single word in memory is atomic */ }
But I confirm that I see this in XtraDB of 5.5.
Interesting. Clearly I got confused between 5.5/10.0 and the XtraDB/InnoDB distinction. There might be more to learn here. In fact, I was looking mostly at the assembler output, to be sure I correctly understood all the #ifdef magic and actual instructions generated, and only refering back to source code for the description. Looks like storage/innobase have a __sync_lock_test_and_set() (with a full barrier on x86) in the mutex_test_and_set() instead. But making sure that all the different combinations are correct looks like it could be a lot of fun...
Agree. What we basically do is: lock_word= 0; // atomic store with release barrier isync; if (waiters) // non-atomic load
But that's wrong, isn't it? Doesn't the isync need to go _after_ the conditional branch? isync discards any following speculated execution. So if you load a value and make a conditional jump that depends on the value loaded, then no instructions can start until after the load completed. Because the load must complete before the conditional can be decided. This effectively gives LoadLoad and LoadStore order. But isync _before_ a conditional - I am not sure that will do anything. But I guess, this was your point, as well ;-)
Here we need to ensure that store completes before load. But: - there is no isync on x86 - isync doesn't guarantee StoreLoad order (there are many articles about this)
Agree.
So full memory barrier seem to be the only option.
Yes...
2. In a user hang I investigated in detail, a thread was waiting on log_sys->mutex, but no other thread was holding that mutex (consistent with lost wakeup). The server error monitor thread was itself stuck on a mutex, which would prevent sync_arr_wake_threads_if_sema_free() from hiding the problem. Strange... Monty should have fixed this. Error monitor thread should call log_get_lsn_nowait(), which basically does trylock. Do you happen to have call trace?
This investigation was some time ago (maybe 1-2 months). It seems likely that this was a version before Monty's log_get_lsn_nowait() (eg. 10.0.14 has it, but not 10.0.13).
Do you have a link to articles that define full memory barrier for mutex
Hm, I don't really have a reference to point to. Well, I believe the posix thread definition says that a number of operations ensure full memory ordering across them.
lock/unlock? Probably that was quite outdated writing. See e.g.:
Outdated, perhaps, but aquire/release is AFAIK a relatively new concept. Mutexes, and InnoDB source code predates that? I'm not sure that C++ coming up with some strange new semantics for their language has much bearing on legacy code? But I agree it would be nice to have some references about "old-style" mutexes implying full memory barrier, so that it's not just me...
http://en.cppreference.com/w/cpp/atomic/memory_order
...Atomic load with memory_order_acquire or stronger is an acquire operation. The lock() operation on a Mutex is also an acquire operation...
...Atomic store with memory_order_release or stronger is a release operation. The unlock() operation on a Mutex is also a release operation...
Interesting... so C++ defines a "Mutex" with different semantics than what is usually understood with eg. pthread_mutex...
Full memory barriers are way too heavy even for mutexes. All we need to to is to ensure that: - all loads following lock are performed after lock (acquire) - all stores preceding unlock are completed before unlock (release)
Are you sure? Note that if this is true, then it means that there is _no_ way to get a StoreLoad barrier using only normal mutex operations. That seems odd... I know I have seen, and written, code that depends on lock/unlock being full barrier. How can we be sure that such code doesn't also exist in InnoDB? Though I agree that full barrier is a lot more expensive than just LoadLoad or StoreStore, and best avoided if possible (I even blogged about that not so long ago). - Kristian.
Kristian, On Thu, Nov 06, 2014 at 09:23:42PM +0100, Kristian Nielsen wrote: ...skip...
Strange... Monty should have fixed this. Error monitor thread should call log_get_lsn_nowait(), which basically does trylock. Do you happen to have call trace?
This investigation was some time ago (maybe 1-2 months). It seems likely that this was a version before Monty's log_get_lsn_nowait() (eg. 10.0.14 has it, but not 10.0.13). According to history ACQUIRE -> RELEASE fix appeared in 10.0.13 and fix for log_get_lsn() appeared in 10.0.14. Both fixes appeared similtaneously in 5.5.40.
lock/unlock? Probably that was quite outdated writing. See e.g.:
Outdated, perhaps, but aquire/release is AFAIK a relatively new concept. Mutexes, and InnoDB source code predates that? I'm not sure that C++ coming up with some strange new semantics for their language has much bearing on legacy code?
But I agree it would be nice to have some references about "old-style" mutexes implying full memory barrier, so that it's not just me... Yes, acquire/release/etc is relatively new concept. For x86 this probably makes
So runtime hangs should be solved both in 5.5 and 10.0. This leaves hangs during startup, which are unfortunate but not as critical as runtime hangs. Are there any other known hangs? little sense. But at least on Power8: - pthread_mutex_lock() issues "isync" (confirms to acquire semantics) - pthread_mutex_unlock() issues "lwsync" (confirms to release semantics) - sync builtins issue "sync" (confirms to seq_cst semantics)
http://en.cppreference.com/w/cpp/atomic/memory_order
...Atomic load with memory_order_acquire or stronger is an acquire operation. The lock() operation on a Mutex is also an acquire operation...
...Atomic store with memory_order_release or stronger is a release operation. The unlock() operation on a Mutex is also a release operation...
Interesting... so C++ defines a "Mutex" with different semantics than what is usually understood with eg. pthread_mutex...
Full memory barriers are way too heavy even for mutexes. All we need to to is to ensure that: - all loads following lock are performed after lock (acquire) - all stores preceding unlock are completed before unlock (release)
Are you sure?
Note that if this is true, then it means that there is _no_ way to get a StoreLoad barrier using only normal mutex operations. That seems odd...
I know I have seen, and written, code that depends on lock/unlock being full barrier. How can we be sure that such code doesn't also exist in InnoDB?
Though I agree that full barrier is a lot more expensive than just LoadLoad or StoreStore, and best avoided if possible (I even blogged about that not so long ago).
That's how I read it. So there is no guarantee that global_var1 will be stored before global_var2 is loaded: global_var1= 1; pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); local_var= global_var2; Even more interesting: it has concept of "affected memory location" bound to memory barrier: ... memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: prior writes made to other memory locations by the thread that did the release become visible in this thread. memory_order_release: A store operation with this memory order performs the release operation: prior writes to other memory locations become visible to the threads that do a consume or an acquire on the same location. ... I read it as "release" on one memory location won't neccessarily make stores visible to "acquire" on a different location. Regards, Sergey
Sergey Vojtovich <svoj@mariadb.org> writes:
According to history ACQUIRE -> RELEASE fix appeared in 10.0.13 and fix for log_get_lsn() appeared in 10.0.14. Both fixes appeared similtaneously in 5.5.40.
So runtime hangs should be solved both in 5.5 and 10.0. This leaves hangs during startup, which are unfortunate but not as critical as runtime hangs.
Are there any other known hangs?
If you are going to argue for deliberately keeping broken mutex_enter() / mutex_exit() in MariaDB, I will stay out of the discussion. I was only trying to get Buildbot green so that I can push my completely unrelated fixes for 10.0. - Kristian.
Kristian, On Fri, Nov 07, 2014 at 09:25:50AM +0100, Kristian Nielsen wrote:
Sergey Vojtovich <svoj@mariadb.org> writes:
According to history ACQUIRE -> RELEASE fix appeared in 10.0.13 and fix for log_get_lsn() appeared in 10.0.14. Both fixes appeared similtaneously in 5.5.40.
So runtime hangs should be solved both in 5.5 and 10.0. This leaves hangs during startup, which are unfortunate but not as critical as runtime hangs.
Are there any other known hangs?
If you are going to argue for deliberately keeping broken mutex_enter() / mutex_exit() in MariaDB, I will stay out of the discussion. I was only trying to get Buildbot green so that I can push my completely unrelated fixes for 10.0. Heh, not really. I just want to understand how much things are broken and how to fix them in 5.5 and 10.0. For 10.1 I already suggested MDEV-6654, which I belive is the right thing to do.
Regards, Sergey
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Hm, I don't really have a reference to point to. Well, I believe the posix thread definition says that a number of operations ensure full memory ordering across them.
I guess I was thinking of this: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_... It is somewhat wague: "The following functions synchronize memory with respect to other threads: ...". I thought some more about this. Consider a simple example: THREAD A THREAD B lock(M) foo() unlock(M) lock(M) bar() unlock(M) Here, I would expect that bar() in B can see all changes made by foo() in A, because B locked the mutex after A released it. This requires that stores in foo() are completed before loads in bar() begin, effectively a full barrier. However, this can be implemented in different ways. It doesn't have to be done with an explicit full barrier CPU instruction, as long as the correct semantics is preserved. So let's consider unlock() with release semantics, and lock() with aquire semantics: unlock(M): __sync_lock_release(M->lock_word) lock(M): while (__sync_lock_test_and_set(M->lock_word, 1)) /* wait for lock */ ; The release barrier ensures that stores in foo() will complete before the store of lock_word. The acquire barrier in the wait loop ensures that loads in bar() cannot start until a load in the while loop has seen the store to lock_word. Thus, loads in bar() become ordered after stores in foo(), effectively a full memory barrier even though no explicit full barrier instruction was needed. So you may be right that the implementation needs no explicit full memory barrier in theory... On the other hand, I have implemented code that _does_ assume a full barrier for each operation individually (see wait_for_commit::wakeup_subsequent_commits2() in 10.0 for an example). Others could have done the same. Reviewing all InnoDB code to make sure it is not the case, it does not really seem feasible... The burden will be on the one who changes InnoDB locking primitives to prove that such change is safe with respect to the expectations of the code that uses those primitives. This is made all the more challenging by the semantics being somewhat unclear, and the main platform x86 mostly doing "full barrier" by default on atomic operations like test-and-set. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
So you may be right that the implementation needs no explicit full memory barrier in theory...
On the other hand, I have implemented code that _does_ assume a full barrier for each operation individually (see
I checked glibc sources for powerpc. They seem to have only acquire barrier for pthread_mutex_lock(), and only release barrier for pthread_mutex_unlock(), assuming I read the code correctly. So looks like I am wrong to think that a full barrier is necessarily implied by pthread_mutex_lock() alone (or pthread_mutex_unlock() alone). I will need to review my earlier code to check for this. Well, usually in fact only the acquire / release is sufficient, but I'll need to check. Thanks, - Kristian.
Hi Kristian, On Fri, Nov 07, 2014 at 08:40:36AM +0100, Kristian Nielsen wrote:
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Hm, I don't really have a reference to point to. Well, I believe the posix thread definition says that a number of operations ensure full memory ordering across them.
I guess I was thinking of this:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_...
It is somewhat wague: "The following functions synchronize memory with respect to other threads: ...".
I thought some more about this. Consider a simple example:
THREAD A THREAD B lock(M) foo() unlock(M) lock(M) bar() unlock(M)
Here, I would expect that bar() in B can see all changes made by foo() in A, because B locked the mutex after A released it. This requires that stores in foo() are completed before loads in bar() begin, effectively a full barrier.
However, this can be implemented in different ways. It doesn't have to be done with an explicit full barrier CPU instruction, as long as the correct semantics is preserved.
So let's consider unlock() with release semantics, and lock() with aquire semantics:
unlock(M): __sync_lock_release(M->lock_word)
lock(M): while (__sync_lock_test_and_set(M->lock_word, 1)) /* wait for lock */ ;
The release barrier ensures that stores in foo() will complete before the store of lock_word. The acquire barrier in the wait loop ensures that loads in bar() cannot start until a load in the while loop has seen the store to lock_word.
Thus, loads in bar() become ordered after stores in foo(), effectively a full memory barrier even though no explicit full barrier instruction was needed.
So you may be right that the implementation needs no explicit full memory barrier in theory... Right. But I just realized that I'm a bit lost in terms now. When you say full memory barrier I assume memory_order_seq_cst. Is there any other type of barrier was available before C11/C++11 acquire/release semantics were introduced?
On the other hand, I have implemented code that _does_ assume a full barrier for each operation individually (see wait_for_commit::wakeup_subsequent_commits2() in 10.0 for an example). Others could have done the same. Reviewing all InnoDB code to make sure it is not the case, it does not really seem feasible...
The burden will be on the one who changes InnoDB locking primitives to prove that such change is safe with respect to the expectations of the code that uses those primitives. This is made all the more challenging by the semantics being somewhat unclear, and the main platform x86 mostly doing "full barrier" by default on atomic operations like test-and-set.
*** I MAY BE VERY WRONG NOW, GUESS BASED STATEMENTS FOLLOW *** If we abstract from platform specific implementation of barriers and think in logical acquire/release terms... In fact there is additional requirement for pthread_mutex_lock() which I omitted before: stores that appear after ACQUIRE must be actually performed after ACQUIRE. Manual says: ... memory_order_acquire: A load operation with this memory order performs the acquire operation on the affected memory location: prior writes made to other memory locations by the thread that did the release become visible in this thread. ... As you can see it only defines that memory LOADS become visible, but doesn't define anything wrt stores performed by current thread. This makes me think that stores are allowed to be postponed, but they are never executed in advance. If that's the case, your code is correct wrt memory barriers. Regards, Sergey
Did this ever get resolved? I enjoyed reading this, you did a nice job explaining the problem. Linux had some fun in this area recently -- https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64 The InnoDB thread that checked waiting threads once per second for wakeup was a hack to fix a misunderstood bug. Fortunately the amazing Yasufumi Kinoshita found the bug -- https://bugs.mysql.com/bug.php?id=29560 On Thu, Nov 6, 2014 at 6:12 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
I have been debugging some test failures seen in Buildbot:
https://mariadb.atlassian.net/browse/MDEV-7026
After some debugging, this is what I think is going on. I will refer to MariaDB 5.5 sources, though the problem is also in 10.0.
InnoDB/XtraDB mutexes are taken in mutex_enter_func() and unlocked in mutex_exit_func().
Simplifying for clarity, here are the relevant parts of the logic:
ENTER:
# First spin wait, hoping to get the mutex without yielding. os_compare_and_swap_ulint(&mutex->waiters, 0, 1); if (os_atomic_test_and_set_byte(&mutex->lock_word, 1)) sync_array_wait_event()
RELEASE:
os_atomic_lock_release_byte(&mutex->lock_word); os_rmb; waiters = *(volatile *)&mutex->waiters; if (waiters) mutex_signal_object(mutex);
The interesting point here is how to ensure that we do not lose the wakeup of a waiting thread when the holding thread releases the mutex.
The idea is that the waiting thread first sets mutex->waiters=1. Then it checks if the mutex is free - if not, it goes to sleep.
The thread that releases the lock first marks the mutex free. Then it checks if there are any waiters, and if so wakes them up.
For this to work, we need to ensure that the setting of waiters=1 has completed before the load of lock_word can start - a full memory barrier. Likewise, we need the setting of lock_word to 0 to complete before the load of mutex->waiters - again a full memory barrier. Otherwise we can get something like this:
ENTER RELEASE Start store to mutex->waiters Start store to lock_word Load mutex->lock_word Store to lock_word becomes visible Load mutex->waiters Store to waiters becomes visible Check lock word==1 Check waiters==0, skip wakeup Sleep (forever)
Or this:
ENTER RELEASE Start store to lock_word Start store to mutex->waiters Load mutex->waiters Store to waiters becomes visible Load mutex->lock_word Store to lock_word becomes visible Check lock word==1 Sleep (forever) Check waiters==0, skip wakeup
(So both full barriers are needed, the one in ENTER to prevent the first example, the one in RELEASE to prevent the second).
It looks like for x86, the full barriers were broken in MariaDB 5.5.40 and 10.0.13.
os_compare_and_swap_ulint() is defined as either __sync_bool_compare_and_swap() or atomic_cas_ulong(), both of which implies a full barrier. So ENTER looks ok.
os_atomic_lock_release_byte() however is now defined as __sync_lock_release(). This is _not_ a full barrier, just a release barrier. So the RELEASE case seems to be incorrect, allowing the second deadlock example to occur.
Previously, RELEASE used os_atomic_test_and_set_byte(), which is __sync_lock_test_and_set(). This is defined in GCC docs as an "aquire" barrier only, which is also wrong, obviously. But it happens to generate an XCHG instruction on x86, which is a full barrier. This is probably why it works on x86 (and probably why it did not work on PowerPC).
Here is a small test program, and the assembly from GCC 5.7 on amd64:
void test(unsigned *p, unsigned *q) { while (__sync_lock_test_and_set(q, 1)) { } *p++; __sync_lock_release(q); }
test: movl $1, %edx .L2: movl %edx, %eax xchgl (%rsi), %eax testl %eax, %eax jne .L2 movl $0, (%rsi) ret
As can be seen, the __sync_lock_test_and_set() generates a full barrier, but the __sync_lock_release() generates nothing (as this is sufficient for release semantics for the rather strong memory order on x86).
So the InnoDB mutexes look completely broken from MariaDB 5.5.40 and 10.0.13 :-(
This is consistent with the observations in MDEV-7026. A thread is waiting for wakeup on a mutex that is already marked free.
Apparently this is not a new problem. This comment in mutex_exit_func() makes me want to cry:
/* A problem: we assume that mutex_reset_lock word is a memory barrier, that is when we read the waiters field next, the read must be serialized in memory after the reset. A speculative processor might perform the read first, which could leave a waiting thread hanging indefinitely.
Our current solution call every second sync_arr_wake_threads_if_sema_free() to wake up possible hanging threads if they are missed in mutex_signal_object. */
So someone found the mutex implementation broken, but instead of fixing it, they implemented a horrible workaround to try and break any deadlocks that might occur.
(The reason that the deadlocks are not broken in the MDEV-7026 seems to be that it occurs during startup, before starting the srv_error_monitor_thread() that does the sync_arr_wake_threads_if_sema_free()).
Now, what is also interesting is that this seems to explain very nicely the many server lockups we have seen for users that upgrade to recent MariaDB versions. I was doing a very detailed investigation of one such case where the user was able to supply unusually detailed information. I found a thread that was stuck waiting on a mutex, but no other threads seemed to be holding that mutex. This is exactly what would be seen in case of a lost wakeup due to this issue. Furthermore, the sync_arr_wake_threads_if_sema_free() mechanism, as well as the normal InnoDB "long semaphore wait", was itself blocked on that same mutex wait, which explains why they were not effective for the user in that particular case.
So it looks like we need to revert the incorrect patch and release new 5.5.40a and 10.0.15 releases ASAP, to solve these hangs for users.
And do a proper patch for PowerPC with proper full memory barriers. (Incidentally, the os_rmb in RELEASE seems to also be part of some PowerPC patch, it also looks completely wrong).
However, more importantly, this issue clearly shows some fundamental problems in our development procedures that we need to fix ASAP.
Just look at it. We have users (and customers!) that suffer from strange hangs. This is their production systems that are down. Support and developers are unable to help them due to the difficulty of debugging rare race conditions in production systems.
And all the time the bug was staring us in the face, easily reproducable in our own Buildbot! But we deliberately chose to ignore those failures. The only thing that matters is "when can you push", and getting more releases out. Never mind what kind of crap is then pushed out to users.
I am deeply ashamed of treating our users like this. And extremely unhappy. I even wrote about this (the importance of not ignoring Buildbot failures) just a few weeks ago. This was completely ignored, I got zero replies, and the weekly calls continued with mindless juggling of Jira numbers and buggy releases.
As a result, our users suffer.
We need to stop this.
Please?
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com
MARK CALLAGHAN <mdcallag@gmail.com> writes:
Did this ever get resolved? I enjoyed reading this, you did a nice job explaining the problem.
It was resolved for x86 in the sense that the code was reverted to the old behaviour. The code is semantically incorrect (uses the wrong barriers), but it happens to generate the correct assembler code on x86 due to its rather strong memory ordering. It is probably still incorrect for some other architectures (eg. multi-core ARM).
Linux had some fun in this area recently -- https://groups.google.com/forum/#!topic/mechanical-sympathy/QbmpZxp6C64
Right. This is very complex technology, very few developers have a full understanding of these issues. And the ones that do can still easily make mistakes.
The InnoDB thread that checked waiting threads once per second for wakeup was a hack to fix a misunderstood bug. Fortunately the amazing Yasufumi
Right. The InnoDB mutex implementation could benefit a lot from being rewritten, I think (and you have written about this as well). I think I heard it mention that MySQL@Oracle are actually working on this (but I do not know details). I am glad you enjoyed reading my write-up. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
MARK CALLAGHAN <mdcallag@gmail.com> writes:
Did this ever get resolved? I enjoyed reading this, you did a nice job explaining the problem.
It was resolved for x86 in the sense that the code was reverted to the old behaviour. The code is semantically incorrect (uses the wrong barriers), but it happens to generate the correct assembler code on x86 due to its rather strong memory ordering.
It is probably still incorrect for some other architectures (eg. multi-core ARM).
What about POWER? It's officially supported by Maria now :) -- Stewart Smith
Stewart Smith <stewart@flamingspork.com> writes:
It is probably still incorrect for some other architectures (eg. multi-core ARM).
What about POWER? It's officially supported by Maria now :)
There is special code for POWER (#ifdef), so it should work... - Kristian.
What about ARM which has a much weaker memory model compared to x86? On Sat, May 23, 2015 at 12:46 AM, Kristian Nielsen <knielsen@knielsen-hq.org
wrote:
Stewart Smith <stewart@flamingspork.com> writes:
It is probably still incorrect for some other architectures (eg. multi-core ARM).
What about POWER? It's officially supported by Maria now :)
There is special code for POWER (#ifdef), so it should work...
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
-- Mark Callaghan mdcallag@gmail.com
MARK CALLAGHAN <mdcallag@gmail.com> writes:
What about ARM which has a much weaker memory model compared to x86?
Yes. There is in fact a bug about it: https://mariadb.atlassian.net/browse/MDEV-7658 - Kristian.
participants (4)
-
Kristian Nielsen
-
MARK CALLAGHAN
-
Sergey Vojtovich
-
Stewart Smith