Hello,
Recently, we experienced a sudden failure on a 3-node MariaDB Galera
cluster (version 10.0.13), and can't find any documentation or discussion
of these particular logs. Prior to the crash, our cluster seemed to be
experiencing connectivity issues, and this particular node was partitioned
from both its peers. It was in a non-primary state when the following
occurred:
150126 6:26:16 [Note] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, GATHER,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4)) suspecting node:
b3156374-a4dc-11e4-90ef-7745fb12a381
150126 6:26:17 [Warning] WSREP: subsequent views have same members, prev
view view(view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4) memb {
fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
}) current view view(view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)
memb {
fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
})
150126 6:26:17 [Note] WSREP:
view(view_id(NON_PRIM,fee139da-a4dc-11e4-896b-13e5de7439e0,5) memb {
fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,0
b3156374-a4dc-11e4-90ef-7745fb12a381,0
})
150126 6:26:17 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no,
my_idx = 0, memb_num = 1
150126 6:26:17 [Note] WSREP: Flow-control interval: [16, 16]
150126 6:26:17 [Note] WSREP: Received NON-PRIMARY.
150126 6:26:17 [Note] WSREP: New cluster view: global state:
7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of
nodes: 1, my index: 0, protocol version 3
150126 6:26:17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping
notification.
150126 6:26:20 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://
0.0.0.0:4567') reconnecting to 7ef72f2a-a4dc-11e4-8f5b-bb400fba8578 (tcp://
10.85.49.128:4567), attempt 0
150126 6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://
0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid
fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126 6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://
0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid
fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126 6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://
0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid
fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126 6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://
0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid
fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126 6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://
0.0.0.0:4567') turning message relay requesting off
150126 6:26:28 [Warning] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign
message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126 6:26:28 [Warning] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign
message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126 6:26:28 [Warning] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign
message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126 6:26:28 [Warning] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign
message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126 6:26:28 [Warning] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign
message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126 6:26:28 [Warning] WSREP:
evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign
message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126 6:26:28 [ERROR] WSREP: exception caused by message:
evs::msg{version=0,type=3,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176868,node_list=()
}
state after handling message:
evs::proto(evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL,
view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6)), INSTALL) {
current_view=view(view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6) memb {
7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=},
fifo_seq=176418,
last_sent=11,
known={
7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,evs::node{operational=1,suspected=0,installed=1,fifo_seq=176868,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4,seq_range=-1,aru_seq=4,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176864,node_list=(
7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node:
{operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
b3156374-a4dc-11e4-90ef-7745fb12a381,node:
{operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[5,4],}
fee139da-a4dc-11e4-896b-13e5de7439e0,node:
{operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4),safe_seq=0,im_range=[1,0],}
)
},
}
fee139da-a4dc-11e4-896b-13e5de7439e0,evs::node{operational=1,suspected=0,installed=1,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11,seq_range=-1,aru_seq=11,flags=0,source=fee139da-a4dc-11e4-896b-13e5de7439e0,source_view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176416,node_list=(
7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node:
{operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
fee139da-a4dc-11e4-896b-13e5de7439e0,node:
{operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),safe_seq=11,im_range=[12,11],}
)
},
}
}
install
msg=evs::msg{version=0,type=5,user_type=255,order=1,seq=4,seq_range=-1,aru_seq=4,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176866,node_list=(
7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node:
{operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
b3156374-a4dc-11e4-90ef-7745fb12a381,node:
{operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[5,4],}
fee139da-a4dc-11e4-896b-13e5de7439e0,node:
{operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),safe_seq=11,im_range=[12,11],}
)
}
}
150126 6:26:28 [ERROR] WSREP: exception from gcomm, backend must be
restarted: nmi != known_.end(): node b3156374-a4dc-11e4-90ef-7745fb12a381
not found from known map (FATAL)
at gcomm/src/evs_proto.cpp:shift_to():2363
150126 6:26:28 [Note] WSREP: Received self-leave message.
150126 6:26:28 [Note] WSREP: Flow-control interval: [0, 0]
150126 6:26:28 [Note] WSREP: Received SELF-LEAVE. Closing connection.
150126 6:26:28 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 43)
150126 6:26:28 [Note] WSREP: RECV thread exiting 0: Success
150126 6:28:16 [Note] WSREP: New cluster view: global state:
7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of
nodes: 0, my index: -1, protocol version 3
150126 6:28:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping
notification.
150126 6:28:16 [Note] WSREP: applier thread exiting (code:0)
At this point, the mysqld process was still running but no further logging
occurred. We know that we can repair this cluster by bootstrapping a single
node, but would like to understand what may have happened to cause this
crash. We've highlighted some specific log messages that seemed more
significant. The logs of the node in question,
b3156374-a4dc-11e4-90ef-7745fb12a381, show that the node continued to
function (attempting to find cluster members) for quite some time after
this crash. It was unable to connect to fee139da-a4dc-11e4-896b-13e5de7439e0and
is still running, still trying to connect at the moment.
Any insight into what's going on here, or has anyone seen similar behavior?
Looking to see whether this is an effect of the connectivity issues we've
seen, or something else entirely. We've also cross-posted this on the
Galera/Codership forums.
Thanks,
Raina and Lyle
Cloud Foundry Services Team