I have seen exactly the same issue again, second day in a row. This time
at 7:30 in the morning and again, no logs or anything that would give me
an idea of what could have caused this.
"killall -9 mysqld" and "service mysql start" on the hanging node is the
only solution I have so far...
I am crossposting this to maria-developers, maybe somebody on that list
has an idea where to start?
Jan
Am 31.05.2013 10:29, schrieb Jan Kirchhoff:
> Hi,
>
> we have been testing MariaDB 5.5.29 with Galera
> (wsrep_provider_version 23.2.4(r147)) for a few months now and have had
> a freeze of the cluster for the second time now. I didn't report the
> first hang about 6 weeks ago as I wanted to be sure it wasn't our fault,
> but I can't find any type of network issue or such in our monitoring
> system and nothing in the logs that could be related to this problem:
>
> We have 2 database servers running Ubuntu 12.10 with Kernel 3.5.0
> (names: db8, db9), with a bunch of 15k-disks, some SSDs, 128GB RAM and a
> test database of about 600-700GB. The amount of binlogs written every
> day is around 500GB, so you get an idea of the workload.
> db8 is a slave (traditional replication) of our main database. We have a
> garbd running on another server as a third galera instance.
>
> Everything runs smooth for some weeks, but since this morning at 3:50am
> db9 seems to be frozen. The server itself looks good, mysqld is taking
> about 50% of one CPU core, everything else looks normal. Nothing in the
> logs. Connecting to mysql works, but any query (like "show status" or
> normal selects) hang forever. Only "show engine innodb status" returns.
> There are almost 3000 processes that piled up like this:
>
> ---TRANSACTION 2F89D7CB, not started
> MySQL thread id 113575, OS thread handle 0x7fab31821700, query id
> 45249946 cl4.logical-line.de 192.168.0.16 stats_pflege sleeping
>
> but none of them ever gets any further. I can issue a "kill 113575" and
> the prompt returns, but nothing happens.
>
> This is what db8's global status says:
> wsrep_local_state_uuid 41d7259d-7200-11e2-0800-943dd0e2f659
> wsrep_protocol_version 4
> wsrep_last_committed 179518511
> wsrep_replicated 53590833
> wsrep_replicated_bytes 14768686288759
> wsrep_received 894995
> wsrep_received_bytes 7161712
> wsrep_local_commits 53583808
> wsrep_local_cert_failures 60
> wsrep_local_bf_aborts 0
> wsrep_local_replays 0
> wsrep_local_send_queue 8
> wsrep_local_send_queue_avg 0.000000
> wsrep_local_recv_queue 0
> wsrep_local_recv_queue_avg 0.000000
> wsrep_flow_control_paused 1.000000
> wsrep_flow_control_sent 0
> wsrep_flow_control_recv 0
> wsrep_cert_deps_distance 73.229730
> wsrep_apply_oooe 0.000000
> wsrep_apply_oool 0.000000
> wsrep_apply_window 0.000000
> wsrep_commit_oooe 0.000000
> wsrep_commit_oool 0.000000
> wsrep_commit_window 0.000000
> wsrep_local_state 4
> wsrep_local_state_comment Synced
> wsrep_cert_index_size 25606
> wsrep_causal_reads 0
> wsrep_incoming_addresses 192.168.10.30:3306,192.168.10.31:3306,
> wsrep_cluster_conf_id 15
> wsrep_cluster_size 3
> wsrep_cluster_state_uuid 41d7259d-7200-11e2-0800-943dd0e2f659
> wsrep_cluster_status Primary
> wsrep_connected ON
> wsrep_local_index 0
> wsrep_provider_name Galera
> wsrep_provider_vendor Codership Oy <info(a)codership.com>
> wsrep_provider_version 23.2.4(r147)
> wsrep_ready ON
>
>
>
>
> Here is the innodb status from db9 (the one that hangs). This is the
> only output I could get out of db9 at all:
>
> Type Name Status
> InnoDB
> =====================================
> 130531 10:02:08 INNODB MONITOR OUTPUT
> =====================================
> Per second averages calculated from the last 18 seconds
> -----------------
> BACKGROUND THREAD
> -----------------
> srv_master_thread loops: 907033 1_second, 906789 sleeps, 89831
> 10_second, 22491 background, 22489 flush
> srv_master_thread log flush and writes: 478222
> ----------
> SEMAPHORES
> ----------
> OS WAIT ARRAY INFO: reservation count 17017782, signal count 287266881
> Mutex spin waits 98936299, rounds 91970366, OS waits 868715
> RW-shared spins 139527705, rounds 455067309, OS waits 6488041
> RW-excl spins 58122849, rounds 842712424, OS waits 8565312
> Spin rounds per wait: 0.93 mutex, 3.26 RW-shared, 14.50 RW-excl
> --------
> FILE I/O
> --------
> I/O thread 0 state: waiting for completed aio requests (insert buffer
> thread)
> I/O thread 1 state: waiting for completed aio requests (log thread)
> I/O thread 2 state: waiting for completed aio requests (read thread)
> I/O thread 3 state: waiting for completed aio requests (read thread)
> I/O thread 4 state: waiting for completed aio requests (read thread)
> I/O thread 5 state: waiting for completed aio requests (read thread)
> I/O thread 6 state: waiting for completed aio requests (write thread)
> I/O thread 7 state: waiting for completed aio requests (write thread)
> I/O thread 8 state: waiting for completed aio requests (write thread)
> I/O thread 9 state: waiting for completed aio requests (write thread)
> Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
> ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
> Pending flushes (fsync) log: 0; buffer pool: 0
> 49405970 OS file reads, 277420410 OS file writes, 6653801 OS fsyncs
> 1 pending preads, 0 pending pwrites
> 49.44 reads/s, 16384 avg bytes/read, 0.61 writes/s, 0.61 fsyncs/s
> -------------------------------------
> INSERT BUFFER AND ADAPTIVE HASH INDEX
> -------------------------------------
> Ibuf: size 1, free list len 698, seg size 700, 300226 merges
> merged operations:
> insert 410879, delete mark 8032461, delete 82075
> discarded operations:
> insert 0, delete mark 0, delete 0
> Hash table size 212497717, node heap has 450852 buffer(s)
> 16846.79 hash searches/s, 77045.28 non-hash searches/s
> ---
> LOG
> ---
> Log sequence number 18522888544071
> Log flushed up to 18522888544071
> Last checkpoint at 18522888544051
> Max checkpoint age 2035833201
> Checkpoint age target 1972213414
> Modified age 20
> Checkpoint age 20
> 0 pending log writes, 0 pending chkp writes
> 6895012 log i/o's done, 0.39 log i/o's/second
> ----------------------
> BUFFER POOL AND MEMORY
> ----------------------
> Total memory allocated 110310195200; in additional pool allocated 0
> Internal hash tables (constant factor + variable factor)
> Adaptive hash index 9086745120 \t(1699981736 + 7386763384)
> Page hash 106249768 (buffer pool 0 only)
> Dictionary cache 426512815 \t(424997488 + 1515327)
> File system 429392 \t(82672 + 346720)
> Lock system 296805680 \t(295674296 + 1131384)
> Recovery system 0 \t(0 + 0)
> Dictionary memory allocated 1515327
> Buffer pool size 6553599
> Buffer pool size, bytes 107374166016
> Free buffers 0
> Database pages 6102747
> Old database pages 2252752
> Modified db pages 1
> Pending reads 1
> Pending writes: LRU 0, flush list 0, single page 0
> Pages made young 48135625, not young 0
> 64.83 youngs/s, 0.00 non-youngs/s
> Pages read 49405737, created 7213276, written 266046626
> 49.44 reads/s, 0.00 creates/s, 0.11 writes/s
> Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
> Pages read ahead 0.00/s, evicted without access 40.11/s, Random read
> ahead 0.00/s
> LRU len: 6102747, unzip_LRU len: 0
> I/O sum[2662]:cur[22], unzip sum[0]:cur[0]
> --------------
> ROW OPERATIONS
> --------------
> 1 queries inside InnoDB, 0 queries in queue
> 2 read views open inside InnoDB
> ---OLDEST VIEW---
> Normal read view
> Read view low limit trx n:o 2F754555
> Read view up limit trx id 2F754555
> Read view low limit trx id 2F754555
> Read view individually stored trx ids:
> -----------------
> Main thread process no. 15888, id 140381271201536, state: sleeping
> Number of rows inserted 46059494, updated 2501543763, deleted 28801859,
> read 17046037029
> 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 48965.50 reads/s
> ------------
> TRANSACTIONS
> ------------
> Trx id counter 2F89DD34
> Purge done for trx's n:o < 2F754554 undo n:o < 0
> History list length 140
> LIST OF TRANSACTIONS FOR EACH SESSION:
> ---TRANSACTION 0, not started
> MySQL thread id 113586, OS thread handle 0x7fab314fe700, query id
> 45250432 localhost root sleeping
> SHOW engine INNODB STATUS
> ---TRANSACTION 2F89DC7B, not started
> MySQL thread id 113583, OS thread handle 0x7fab315d9700, query id
> 45250367 cl4.logical-line.de 192.168.0.16 stats_pflege sleeping
> ---TRANSACTION 2F89DB4F, not started
> MySQL thread id 113582, OS thread handle 0x7fab31622700, query id
> 45250261 cl4.logical-line.de 192.168.0.16 stats_pflege sleeping
> ---TRANSACTION 2F89DA23, not started
> MySQL thread id 113578, OS thread handle 0x7fab31746700, query id
> 45250159 cl4.logical-line.de 192.168.0.16 stats_pflege sleeping
> ---TRANSACTION 0, not started
> <many many rows like these, around 3000 although max_connections is 300?! >
>
> As soon as I trigger the restart of mysql on db9 (service mysql
> restart), I see normal log entries like this:
> May 31 10:16:59 db9 mysqld: 130531 10:16:59 [Note] WSREP: Stop replication
> May 31 10:16:59 db9 mysqld: 130531 10:16:59 [Note] WSREP: Closing send
> monitor...
> May 31 10:16:59 db9 mysqld: 130531 10:16:59 [Note] WSREP: Closed send
> monitor.
> May 31 10:16:59 db9 mysqld: 130531 10:16:59 [Note] WSREP: gcomm:
> terminating thread
>
> and db8 is accepting inserts again and catches up with replication.
>
> meanwhile, db9 says:
> May 31 10:17:02 db9 mysqld: 130531 10:17:02 [Note] WSREP: killing local
> connection: 113529
> May 31 10:17:02 db9 mysqld: 130531 10:17:02 [Note] WSREP: killing local
> connection: 113199
> May 31 10:17:02 db9 mysqld: 130531 10:17:02 [Note] WSREP: killing local
> connection: 110221
>
> and nothing happens any more. No CPU and no IO load coming from the
> mysqld processes.
> After 5 minutes I killed the mysqld with "killall -9 mysqld" and then
> started it again the normal way.
> db9 syncs via IST and everything is fine again. But I have no idea what
> caused all this. According to a colleague killing mysql on db9 was the
> thing that helped 6 weeks ago, too.
>
> Thanks for any pointers where we should look further.
>
> Jan Kirchhoff
>