Sergey Vojtovich <svoj@mariadb.org> writes:
Look at the cset comment: every mutex_exit() has to issue full memory barrier unconditionally!
Oh, you're right. I mixed up the code paths between mutex_exit() and the other side (in mutex_spin_wait()).
me> Strange... Monty should have fixed this. Error monitor thread should call me> log_get_lsn_nowait(), which basically does trylock. Do you happen to have call me> trace?
me> According to history ACQUIRE -> RELEASE fix appeared in 10.0.13 and fix for me> log_get_lsn() appeared in 10.0.14. Both fixes appeared similtaneously in 5.5.40. me>
Stating that this patch fixes run-time hangs that I'm not aware of is kind of strange.
So I repeat my question: Are there any other known hangs?
Well, there are no hangs that we "know" is caused by this bug. There are hangs that we suspect could be caused by this bug. Monty's patch, as you say, is not in 10.0.13. And it is insufficient, Jan apparently changed another mutex lock to be trylock, which is not in 10.0.14 or 5.5.40, IIUC. There might be other ways for the error monitor thread to get stuck, and even if not there can still be a server stall for 1 second. I think you already know all of this, so I'm not sure what answer you are looking for from me, sorry...
Could you suggest better wording for cset comment?
Here is a suggestion: "In MariaDB 5.5.40 and 10.0.13, the InnoDB/XtraDB low-level mutex implementation was inadvertently broken, so that a waiter may miss the wakeup when another thread releases the mutex. This affects at least x86 and amd64 architectures. This could result in threads occasionally stalling for about 1 second, or in some cases even hanging the whole server infinitely." Hope this helps, - Kristian.