Hello,

Recently, we experienced a sudden failure on a 3-node MariaDB Galera cluster (version 10.0.13), and can't find any documentation or discussion of these particular logs. Prior to the crash, our cluster seemed to be experiencing connectivity issues, and this particular node was partitioned from both its peers. It was in a non-primary state when the following occurred:

 150126  6:26:16 [Note] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, GATHER, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4)) suspecting node: b3156374-a4dc-11e4-90ef-7745fb12a381
150126  6:26:17 [Warning] WSREP: subsequent views have same members, prev view view(view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4) memb {
        fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
}) current view view(view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5) memb {
        fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
})
150126  6:26:17 [Note] WSREP: view(view_id(NON_PRIM,fee139da-a4dc-11e4-896b-13e5de7439e0,5) memb {
        fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
        7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,0
        b3156374-a4dc-11e4-90ef-7745fb12a381,0
})
150126  6:26:17 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150126  6:26:17 [Note] WSREP: Flow-control interval: [16, 16]
150126  6:26:17 [Note] WSREP: Received NON-PRIMARY.
150126  6:26:17 [Note] WSREP: New cluster view: global state: 7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
150126  6:26:17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150126  6:26:20 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') reconnecting to 7ef72f2a-a4dc-11e4-8f5b-bb400fba8578 (tcp://10.85.49.128:4567), attempt 0
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') turning message relay requesting off
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176868,node_list=()
}
 state after handling message: evs::proto(evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6)), INSTALL) {
current_view=view(view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6) memb {
        7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=},
fifo_seq=176418,
last_sent=11,
known={
        7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,evs::node{operational=1,suspected=0,installed=1,fifo_seq=176868,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4,seq_range=-1,aru_seq=4,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176864,node_list=(   7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
        b3156374-a4dc-11e4-90ef-7745fb12a381,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[5,4],}
        fee139da-a4dc-11e4-896b-13e5de7439e0,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4),safe_seq=0,im_range=[1,0],}
)
},
}
        fee139da-a4dc-11e4-896b-13e5de7439e0,evs::node{operational=1,suspected=0,installed=1,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11,seq_range=-1,aru_seq=11,flags=0,source=fee139da-a4dc-11e4-896b-13e5de7439e0,source_view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176416,node_list=( 7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
        fee139da-a4dc-11e4-896b-13e5de7439e0,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),safe_seq=11,im_range=[12,11],}
)
},
}
 }
install msg=evs::msg{version=0,type=5,user_type=255,order=1,seq=4,seq_range=-1,aru_seq=4,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176866,node_list=(       7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
        b3156374-a4dc-11e4-90ef-7745fb12a381,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[5,4],}
        fee139da-a4dc-11e4-896b-13e5de7439e0,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),safe_seq=11,im_range=[12,11],}
)
}
 }
 
150126  6:26:28 [ERROR] WSREP: exception from gcomm, backend must be restarted: nmi != known_.end(): node b3156374-a4dc-11e4-90ef-7745fb12a381 not found from known map (FATAL)
         at gcomm/src/evs_proto.cpp:shift_to():2363
150126  6:26:28 [Note] WSREP: Received self-leave message.
150126  6:26:28 [Note] WSREP: Flow-control interval: [0, 0]
150126  6:26:28 [Note] WSREP: Received SELF-LEAVE. Closing connection.
150126  6:26:28 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 43)
150126  6:26:28 [Note] WSREP: RECV thread exiting 0: Success
150126  6:28:16 [Note] WSREP: New cluster view: global state: 7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
150126  6:28:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150126  6:28:16 [Note] WSREP: applier thread exiting (code:0)

At this point, the mysqld process was still running but no further logging occurred. We know that we can repair this cluster by bootstrapping a single node, but would like to understand what may have happened to cause this crash. We've highlighted some specific log messages that seemed more significant. The logs of the node in question, b3156374-a4dc-11e4-90ef-7745fb12a381, show that the node continued to function (attempting to find cluster members) for quite some time after this crash. It was unable to connect to fee139da-a4dc-11e4-896b-13e5de7439e0and is still running, still trying to connect at the moment. 

Any insight into what's going on here, or has anyone seen similar behavior? Looking to see whether this is an effect of the connectivity issues we've seen, or something else entirely. We've also cross-posted this on the Galera/Codership forums.

Thanks,

Raina and Lyle
Cloud Foundry Services Team