Re: [Maria-discuss] Semaphore hangs

15 Dec 2016

      ...
On 12/08/2016 03:13 PM, Daniel Black wrote:
...
On 09/12/16 09:43, Jon Foster wrote:
...
On 12/07/2016 06:04 PM, Daniel Black wrote:
...
On 08/12/16 08:51, Jon Foster wrote:
...
We are having trouble with MariaDB hanging due to a "semaphore wait". We
then have to shut MariaDB down as it typically won't recover, unless it
restarts itself, which happens if we wait long enough. But if its gone
on long enough MariaDB won't even shutdown, it hangs indefinitely
waiting for some other internal service. I don't remember the exact name
and we've been fast enough I haven't seen it in a while.
We've had the database on two completely different servers and still see
the problem. Both servers were bought new for this project and are a
year or less old. They are running all SSD drives, Debian 7 64bit with
MariaDB 10.1 from the MariaDB APT repository.
Since the XtraDB engine was usually mentioned in the logged messages we
switched back to the Oracle InnoDB engine. Although this seems to have
reduced the frequency it didn't fix it.
Can anyone give some advice on fixing this. It really seams like a bug
in MariaDB. I'll try to provide any needed info.
[...]
So its happened again on Tuesday (12/13) morning, early enough the east 
coasters got it before I was aware of it (they are 3hrs ahead and I was 
just getting up). Unfortunately I wasn't able to try the "gdb" request from
On 12/08/2016 04:16 PM, Jon Foster wrote:
the previous discussion on this topic. So I've been looking for ways to 
cross reference all the thread and mutexes mentioned to try and pinpoint 
where the failure is happening.

This crash produced over 430MB of log data. I sliced out the first InnoDB 
monitor dump (a mere 1.5MB) and stripped it down to just the threads and 
related messages. I'm still reviewing the logs but I found something I 
thought was interesting enough I'd throw it out here and see if anyone had 
any thoughts.

There were 4,894 threads listed in the dump. But it appears that everyone 
was waiting for one thread. Here is what the log said about that one thread:

06:01:42 --Thread 139879467059968 has waited at trx0sys.ic line 431 for 
0.00 seconds the semaphore:
06:01:42 Mutex at 0x7f3a09a92068 created file trx0sys.cc line 729, lock var 0
06:01:42 Last time reserved by thread 18446744073709551615 in file not yet 
reserved line 0, waiters flag 0

I trimmed out the data and server name to shorten the lines. Several 
interesting things to note:

1. Thread 18446744073709551615 doesn't exist in the InnoDB monitor dump.
2. All of the other thread IDs are 15 digits. This one is 20 digits.
3. Over a thousand other threads are waiting on this one because it 
apparently has the lock_sys->mutex mutex. All of the remaining threads are 
waiting on those others.
4. This thread shows a 0 second wait time when many of the other threads 
say they've been waiting over 250 seconds.

Sure looks like the mutex is being held by a non-existent thread. Memory 
corruption?

I'm still looking over the logs so I might find some other stuff or 
something else to point the finger at. But I thought I'd throw this out 
there and see if anyone has some insight. Or maybe I should be taking this 
issue to another list or report it as a bug?

THX - Jon

-- 
Sent from my Debian Linux workstation -- http://www.debian.org/intro/about

Jon Foster
JF Possibilities, Inc.
jon@jfpossibilities.com
541-410-2760
Making computers work for you!