Anyone?

On Thu, Jun 20, 2019 at 5:59 PM Tiemen Ruiten <t.ruiten@tech-lab.io> wrote:
Hello,

I'm trying to do a rolling upgrade of a MariaDB 10.3 Galera cluster to to 10.4, following the instructions here. (I had to adjust the wsrep_provider variable for Galera 4 to /usr/lib64/galera-4/libgalera_smm.so, not mentioned in the instructions). After I start the mariadb service, one mysqld thread uses 100% of one core for about 45 minutes, then I see the following in the log and mariadb restarts, repeating the same process:

Jun 20 16:24:44 rubidium mysqld: 2019-06-20 16:24:44 0 [Note] WSREP: Trying to continue unpaused monitor
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: REPL Protocols: 9 (4, 2)
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: ####### My UUID: ddb4f4da-935f-11e9-bcff-8209857266f0
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: Server mariadb4 connected to cluster at position 6a1fd0aa-4739-11e5-af88-42756e861245:7403466644 with ID ddb4f4da-935f-11e9-bcff-8209857266f0
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: Server status change disconnected -> connected
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: State transfer required:
Jun 20 16:24:47 rubidium mysqld: Group state: 6a1fd0aa-4739-11e5-af88-42756e861245:7403466644
Jun 20 16:24:47 rubidium mysqld: Local state: 6a1fd0aa-4739-11e5-af88-42756e861245:7403368279
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: Server status change connected -> joiner
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Jun 20 16:24:47 rubidium mysqld: 2019-06-20 16:24:47 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.100.130.24' --datadir '/var/lib/mysql/'   --parent '21283'  ''  '''
Jun 20 16:24:48 rubidium rsyncd[26235]: rsyncd version 3.1.2 starting, listening on port 4444
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [Note] WSREP: Prepared SST request: rsync|10.100.130.24:4444/rsync_sst
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [Note] WSREP: Cert index reset to 6a1fd0aa-4739-11e5-af88-42756e861245:7403466644 (proto: 9), state transfer needed: yes
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 0 [Note] WSREP: Service thread queue flushed.
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [Note] WSREP: ####### Assign initial position for certification: 6a1fd0aa-4739-11e5-af88-42756e861245:7403466644, protocol version: 4
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [Note] WSREP: ####### IST uuid:6a1fd0aa-4739-11e5-af88-42756e861245 f: 7403368280, l: 7403466644, STRv: 2
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [Note] WSREP: IST receiver addr using tcp://10.100.130.24:4568
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [Note] WSREP: Prepared IST receiver for 7403368280-7403466644, listening at: tcp://10.100.130.24:4568
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 0 [Note] WSREP: Member 2.0 (mariadb4) requested state transfer from 'mariadb2'. Selected 1.0 (mariadb2)(SYNCED) as donor.
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 7403946566)
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [ERROR] WSREP: Slave queue grew too long while trying to request state transfer 1 time(s). Please make sure that there is at least one fully synced member in the group. Application must be restarted.
Jun 20 16:24:48 rubidium mysqld: 2019-06-20 16:24:48 3 [ERROR] WSREP: State transfer request failed unrecoverably: 35 (Resource deadlock avoided). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

On the donor, I see:

2019-06-20 16:24:44 0 [Note] WSREP: Trying to continue unpaused monitor
2019-06-20 16:24:48 0 [Note] WSREP: Member 2.0 (mariadb4) requested state transfer from 'mariadb2'. Selected 1.0 (mariadb2)(SYNCED) as donor.
2019-06-20 16:24:48 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 7403946566)
2019-06-20 16:24:48 18 [Note] WSREP: IST request: 6a1fd0aa-4739-11e5-af88-42756e861245:7403368279-7403466644|tcp://10.100.130.24:4568
2019-06-20 16:24:48 18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-06-20 16:24:48 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '10.100.130.24:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid '6a1fd0aa-4739-11e5-af88-42756e861245:7403368279' --gtid-domain-id '0' --bypass'
2019-06-20 16:24:48 18 [Note] WSREP: sst_donor_thread signaled with 0
2019-06-20 16:24:48 0 [Note] WSREP: async IST sender starting to serve tcp://10.100.130.24:4568 sending 7403368280-7403466644
2019-06-20 16:24:48 0 [Note] WSREP: (700b363f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.100.130.24:4567
2019-06-20 16:24:48 0 [Note] WSREP: declaring 116c0b4c at tcp://10.100.130.22:4567 stable
2019-06-20 16:24:48 0 [Note] WSREP: forgetting ddb4f4da (tcp://10.100.130.24:4567)
2019-06-20 16:24:48 0 [Note] WSREP: (700b363f, 'tcp://0.0.0.0:4567') turning message relay requesting off
2019-06-20 16:24:48 0 [Note] WSREP: Node 116c0b4c state prim
2019-06-20 16:24:48 0 [Note] WSREP: view(view_id(PRIM,116c0b4c,297) memb {
        116c0b4c,0
        700b363f,0
} joined {
} left {
} partitioned {
        ddb4f4da,0
})
2019-06-20 16:24:48 0 [Note] WSREP: save pc into disk
2019-06-20 16:24:48 0 [Note] WSREP: forgetting ddb4f4da (tcp://10.100.130.24:4567)
2019-06-20 16:24:48 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2019-06-20 16:24:48 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2019-06-20 16:24:48 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 286f5ce4-9367-11e9-acee-4e4dec61d7c1
2019-06-20 16:24:48 0 [Note] WSREP: STATE EXCHANGE: got state msg: 286f5ce4-9367-11e9-acee-4e4dec61d7c1 from 0 (mariadb1)
2019-06-20 16:24:48 0 [Note] WSREP: STATE EXCHANGE: got state msg: 286f5ce4-9367-11e9-acee-4e4dec61d7c1 from 1 (mariadb2)
2019-06-20 16:24:48 0 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 269,
        members    = 2/2 (joined/total),
        act_id     = 7403946566,
        last_appl. = 7403946444,
        protocols  = 0/9/3 (gcs/repl/appl),
        group UUID = 6a1fd0aa-4739-11e5-af88-42756e861245
2019-06-20 16:24:48 0 [Note] WSREP: Flow-control interval: [23, 23]
2019-06-20 16:24:48 0 [Note] WSREP: Trying to continue unpaused monitor
2019-06-20 16:24:48 14 [Note] WSREP: New cluster view: global state: 6a1fd0aa-4739-11e5-af88-42756e861245:7403946566, view# 270: Primary, number of nodes: 2, my index: 1, protocol version 3
2019-06-20 16:24:48 14 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2019-06-20 16:24:48 14 [Note] WSREP: REPL Protocols: 9 (4, 2)
2019-06-20 16:24:48 14 [Note] WSREP: Assign initial position for certification: 7403946566, protocol version: 4
2019-06-20 16:24:48 0 [Note] WSREP: Service thread queue flushed.
2019-06-20 16:24:48 0 [ERROR] WSREP: async IST sender failed to serve tcp://10.100.130.24:4568: ist send failed: asio.system:104', asio error 'read: Connection reset by peer': 104 (Connection reset by peer)
         at galera/src/ist.cpp:send():790
2019-06-20 16:24:48 0 [Note] WSREP: async IST sender served
WSREP_SST: [INFO] Bypassing state dump. (20190620 16:24:48.149)
rsync: safe_read failed to read 1 bytes [sender]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at io.c(276) [sender=3.1.2]
2019-06-20 16:24:48 0 [ERROR] WSREP: Failed to read from: wsrep_sst_rsync --role 'donor' --address '10.100.130.24:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid '6a1fd0aa-4739-11e5-af88-42756e861245:7403368279' --gtid-domain-id '0' --bypass
2019-06-20 16:24:48 0 [ERROR] WSREP: Process completed with error: wsrep_sst_rsync --role 'donor' --address '10.100.130.24:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid '6a1fd0aa-4739-11e5-af88-42756e861245:7403368279' --gtid-domain-id '0' --bypass: 12 (Cannot allocate memory)
2019-06-20 16:24:48 0 [ERROR] WSREP: Command did not run: wsrep_sst_rsync --role 'donor' --address '10.100.130.24:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid '6a1fd0aa-4739-11e5-af88-42756e861245:7403368279' --gtid-domain-id '0' --bypass
2019-06-20 16:24:48 0 [Warning] WSREP: Could not find peer: ddb4f4da-935f-11e9-bcff-8209857266f0
2019-06-20 16:24:48 0 [Warning] WSREP: 1.0 (mariadb2): State transfer to -1.-1 (left the group) failed: -12 (Cannot allocate memory)

What is going on?