Re: [Maria-discuss] Can't connect to local MySQL server through socket
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi
I'll try hard to fix the problem with rsync, even if I switch to mysqldump or xtrabackup-v2 in the future.
Some context : mariadb and galera are built in buildroot and run on raspberry pis (ARMv6). Mariadb is started manually. I fixed the config file as you suggested yesterday (wsrep_cluster_address=gcomm://IP1,IP2,IP3 in my.cnf on all three nodes). I bootstrap the cluster on node #1 with --wsrep-new-cluster and I just start mysqld_safe and nodes #2 and #3.
Running: mysql -u root --execute="SHOW GLOBAL STATUS WHERE Variable_name IN ('wsrep_ready', 'wsrep_cluster_size', 'wsrep_cluster_status', 'wsrep_connected');"
on node #1 gives me: +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_size | 3 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_ready | ON | +----------------------+---------+
and on nodes #2 and #3: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111 "Connection refused")
If I start mariadb outside a cluster on nodes #2 and #3 it works as expected.
It seems nodes #2 and #3 are never synced since SYNCED never appears in error log.
wsrep_sst_method is rsync.
On 24/06/2015 15:00, Guillaume Lefranc wrote:
Does this mean that mariadb tries to sync nodes but for some reason the script hangs, the nodes are never sync and remain unusable?
I'm afraid that must be something along those lines. I have the following suspicions:
* SELinux or Apparmor not disabled, causing the SST to hang forever. * Ports not open on the firewall (for the record you need 3306, but also 4444, 4567 and 4568).
Neither SELinux or Apparmor is enabled and there is no firewall on any of the boxes. They are all connected to the same ethernet switch.
I'm investigating wsrep_sst_rsync script. Here are the processes running:
# ps auxf | grep mysql 128 root {mysqld_safe} /bin/sh /usr/bin/mysqld_safe - --defaults-file=/etc/mysql/my.cnf 422 mysql /usr/bin/mysqld --defaults-file=/etc/mysql/my.cnf - --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/plugin - --user=mysql --wsrep_provider=/usr/lib/libgalera_smm.so - --log-error=/var/lib/mysql/buildroot.err --pid-file=/tmp/mysql.pid - --socket=/tmp/mysql.sock --port=3306 - --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1 429 mysql sh -c wsrep_sst_rsync --role 'joiner' --address '192.168.1.80' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '422' --binlog '/var/lib/mysql/mariadb-bin' 430 mysql {wsrep_sst_rsync} /bin/bash -ue /usr//bin/wsrep_sst_rsync --role joiner --address 192.168.1.80 --auth --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --parent 422 --binlog /var/lib/mysql/mariadb-bin 458 mysql rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf 26966 mysql sleep 0.2
Note the sleep 0.2: it shows up often. It appears in wsrep_sst_rsync in two places:
lines 136--140: # wait for tables flushed and state ID written to the file
while [ ! -r "$FLUSHED" ] && ! grep -q ':' "$FLUSHED"
/dev/null 2>&1
do sleep 0.2 done
lines 281--284: until check_pid_and_port $RSYNC_PID $RSYNC_REAL_PID $RSYNC_PORT do sleep 0.2 done
Is it possible that one these loops never terminates?
If I run the same commands invoked by mysql manually, they terminate with errors because SST_PROGRESS_FILE and WSREP_SST_OPT_ROLE aren't defined but this is probably not significant.
I'll be happy to read any idea you might have. Maybe I should also send this message to galera mailing list?
Cheers,
- -- Sylvain Raybaud www.green-communications.fr
Sylvain,
I don't know much what buildroot does, so I don't know if you're hitting any limitation that buildroot might have.
Just a suggestion, you can try adding "set -x" to /usr/bin/wsrep_sst_rsync, so the script will dump its output in the log. You should be able to know where it hangs precisely then.
You can also try to run the SST command manually on the nodes, and see what it does. You can get the full command output in ps so you're free to start a donor on one node and a joiner on another node and follow the script output.
Best,
2015-06-25 12:27 GMT+02:00 Sylvain Raybaud < sylvain.raybaud@green-communications.fr>:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi
I'll try hard to fix the problem with rsync, even if I switch to mysqldump or xtrabackup-v2 in the future.
Some context : mariadb and galera are built in buildroot and run on raspberry pis (ARMv6). Mariadb is started manually. I fixed the config file as you suggested yesterday (wsrep_cluster_address=gcomm://IP1,IP2,IP3 in my.cnf on all three nodes). I bootstrap the cluster on node #1 with --wsrep-new-cluster and I just start mysqld_safe and nodes #2 and #3.
Running: mysql -u root --execute="SHOW GLOBAL STATUS WHERE Variable_name IN ('wsrep_ready', 'wsrep_cluster_size', 'wsrep_cluster_status', 'wsrep_connected');"
on node #1 gives me: +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_size | 3 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_ready | ON | +----------------------+---------+
and on nodes #2 and #3: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111 "Connection refused")
If I start mariadb outside a cluster on nodes #2 and #3 it works as expected.
It seems nodes #2 and #3 are never synced since SYNCED never appears in error log.
wsrep_sst_method is rsync.
On 24/06/2015 15:00, Guillaume Lefranc wrote:
Does this mean that mariadb tries to sync nodes but for some reason the script hangs, the nodes are never sync and remain unusable?
I'm afraid that must be something along those lines. I have the following suspicions:
* SELinux or Apparmor not disabled, causing the SST to hang forever. * Ports not open on the firewall (for the record you need 3306, but also 4444, 4567 and 4568).
Neither SELinux or Apparmor is enabled and there is no firewall on any of the boxes. They are all connected to the same ethernet switch.
I'm investigating wsrep_sst_rsync script. Here are the processes running:
# ps auxf | grep mysql 128 root {mysqld_safe} /bin/sh /usr/bin/mysqld_safe - --defaults-file=/etc/mysql/my.cnf 422 mysql /usr/bin/mysqld --defaults-file=/etc/mysql/my.cnf - --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/plugin - --user=mysql --wsrep_provider=/usr/lib/libgalera_smm.so - --log-error=/var/lib/mysql/buildroot.err --pid-file=/tmp/mysql.pid - --socket=/tmp/mysql.sock --port=3306 - --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1 429 mysql sh -c wsrep_sst_rsync --role 'joiner' --address '192.168.1.80' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '422' --binlog '/var/lib/mysql/mariadb-bin' 430 mysql {wsrep_sst_rsync} /bin/bash -ue /usr//bin/wsrep_sst_rsync --role joiner --address 192.168.1.80 --auth --datadir /var/lib/mysql/ --defaults-file /etc/mysql/my.cnf --parent 422 --binlog /var/lib/mysql/mariadb-bin 458 mysql rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf 26966 mysql sleep 0.2
Note the sleep 0.2: it shows up often. It appears in wsrep_sst_rsync in two places:
lines 136--140: # wait for tables flushed and state ID written to the file
while [ ! -r "$FLUSHED" ] && ! grep -q ':' "$FLUSHED"
/dev/null 2>&1
do sleep 0.2 done
lines 281--284: until check_pid_and_port $RSYNC_PID $RSYNC_REAL_PID $RSYNC_PORT do sleep 0.2 done
Is it possible that one these loops never terminates?
If I run the same commands invoked by mysql manually, they terminate with errors because SST_PROGRESS_FILE and WSREP_SST_OPT_ROLE aren't defined but this is probably not significant.
I'll be happy to read any idea you might have. Maybe I should also send this message to galera mailing list?
Cheers,
- -- Sylvain Raybaud www.green-communications.fr -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQIcBAEBAgAGBQJVi9d9AAoJEEkkwl4JtJ9yE6EP/3PVzEXO7KVX9UrKrQSLXXM/ EWuV2a2KKxB8GkH1uDBISrn84futPR6M7/gdbMV8DeRNWAbCWgXWkrY/HVALvLxu jze4znZMmn+mxqDzmhp1klon+w9WAyH7lSfIC4AGgSiAe6ZFP5c8CSfbLNJPlSM+ buoJTQz9BFo+NTh4w7gbZUjAJVEm/7CpM8TwMPsu7+mMvbn4yMMLU2RkjhysGGKI YAJ9dXYjmD+49Z3Z2B54lRGeIbZNMDRjlQ/+F61Ml9XbQYJtDb8gHGN5a4FIiaFh pvOtjCi5foXdGfGv/nPbSkrMqkKBJHL0xbnKeze/OdIdZ1aOVN7Sy5tZJirlCJV3 q/48hmrIIMVU1FuC9F0zW7lGHsgsKiQvLM0X8C/ZdX0ZTvTjWWefoWLa/fSv/vYp z9E5HJxJ47gbnGeSEBby1+TmV/GoXvjiyH3oRXzEuyEP5xQtKH4Kvrh9MhwZdZpv sJc5QO9qfAHPecavhRQ1luuv6JCSqLPGpF+3RB80rB+BRUkwi0dfO0RmBzzmKBGZ vryGxAVBNZmpIatuCWY4cvvrkQm9kF7JMAR1rNTRylxLtyI5TR42c5r24tSasWct P6gICSrMBKVgoCdxanTI2q+faqAur/NDq4W6tgeqRnV/mk9zQW4oxJxfwQlS0zr6 YiUMXGeRi6IT0MTKwChk =goS7 -----END PGP SIGNATURE-----
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Guillaume,
On 25/06/2015 13:40, Guillaume Lefranc wrote:
Just a suggestion, you can try adding "set -x" to /usr/bin/wsrep_sst_rsync, so the script will dump its output in the log. You should be able to know where it hangs precisely then.
It gives me the following sequence repating forever:
+ check_pid_and_port /var/lib/mysql//rsync_sst.pid 20189 4444 + local pid_file=/var/lib/mysql//rsync_sst.pid + local rsync_pid=20189 + local rsync_port=4444 + which lsof ++ lsof -i :4444 -Pn ++ grep '(LISTEN)' + local port_info= ++ echo ++ grep -w '^rsync[[:space:]]+20189' + local is_rsync= + '[' -n '' -a -z '' ']'
It seems to correspond to lines 281--284: until check_pid_and_port $RSYNC_PID $RSYNC_REAL_PID $RSYNC_PORT do sleep 0.2 done
check_pid_and_port seems to be checking that rsync is running and listening, basically. Strange thing is: lsof -i :4444 -Pn doesn't return anything although ps shows that rsync was invoked correctly: rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
Actually, lsof -i :4444 -Pn seems to behave rather differently on my laptop (ubuntu) and on buildroot. Indeed, lsof in buildroot is provided by busybox by default. This sometimes leads to significant differences. I'm going to rebuild my system with the real lsof package and see if it gets better. I'll let you know.
You can also try to run the SST command manually on the nodes, and see what it does. You can get the full command output in ps so you're free to start a donor on one node and a joiner on another node and follow the script output.
I did, and it fails because some variables are unbound. I think this is specific to manual invokation.
- -- Sylvain Raybaud www.green-communications.fr
Well, if you had said that your shell was busybox in the first place, that would have saved us a lot of time... :-)
2015-06-25 14:41 GMT+02:00 Sylvain Raybaud < sylvain.raybaud@green-communications.fr>:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Guillaume,
On 25/06/2015 13:40, Guillaume Lefranc wrote:
Just a suggestion, you can try adding "set -x" to /usr/bin/wsrep_sst_rsync, so the script will dump its output in the log. You should be able to know where it hangs precisely then.
It gives me the following sequence repating forever:
+ check_pid_and_port /var/lib/mysql//rsync_sst.pid 20189 4444 + local pid_file=/var/lib/mysql//rsync_sst.pid + local rsync_pid=20189 + local rsync_port=4444 + which lsof ++ lsof -i :4444 -Pn ++ grep '(LISTEN)' + local port_info= ++ echo ++ grep -w '^rsync[[:space:]]+20189' + local is_rsync= + '[' -n '' -a -z '' ']'
It seems to correspond to lines 281--284: until check_pid_and_port $RSYNC_PID $RSYNC_REAL_PID $RSYNC_PORT do sleep 0.2 done
check_pid_and_port seems to be checking that rsync is running and listening, basically. Strange thing is: lsof -i :4444 -Pn doesn't return anything although ps shows that rsync was invoked correctly: rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
Actually, lsof -i :4444 -Pn seems to behave rather differently on my laptop (ubuntu) and on buildroot. Indeed, lsof in buildroot is provided by busybox by default. This sometimes leads to significant differences. I'm going to rebuild my system with the real lsof package and see if it gets better. I'll let you know.
You can also try to run the SST command manually on the nodes, and see what it does. You can get the full command output in ps so you're free to start a donor on one node and a joiner on another node and follow the script output.
I did, and it fails because some variables are unbound. I think this is specific to manual invokation.
- -- Sylvain Raybaud www.green-communications.fr
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQIcBAEBAgAGBQJVi/bfAAoJEEkkwl4JtJ9y2VQP/08Fa6NVtKwH2tlgJN5TOyqB TNKOEqvSW9LFIjCSpAbxANqkWRmHPbSfoxwK9Komqtp4B4YdIkHSDMxIgx5e8j+W VPJL9nLqB26g+jn3KkKS1N/uqK5O5DwUCBxT3+XtQkk0PpboemIQeE/pKpqoWCtG BMYEcFxO8pj91ICyy8dGHwQfmza5jLjnWQmJoqDlxNJs42YnDAnYVAIFgX6tsu0O lwWNHComc6dov+dEe6IgMow9sL9GyQBmeUg+jeB+7RpmtParPY9ISb0HRCwDQ9K/ 4nkgq9gJaCKUXnTwoIU8SIPLuEeXIqpBJNwQSpB6Sz9QlNGb11bH2y1Z/tEF/aQ9 rbp4F3VbWds66fJzBp9NbjqOP1/ZLkucwJN3EBVGqF8R2HfVMwGsjF3Mhz4iDajJ gOmKnAf3lgdyfA1zD0lGVqZlm+V/c13ODlDlutqpOJ7a6BUZ3rDEYtecLgAzEjNt eA7pBPXPARTS/ploYEHmUhFqw/oxRacnmScFrQgF0N/v3xal2CFSZD900oOKhs32 bfBTgqCvrCCrOSxWVV+4BdUUC+AS7S/pfQyJo1NyeK9aUy/wX+f7T3RtuZiyAzNM 9j2ewgM7WEYrLIJmEbBjYR4KYJ0EETc3Pw3di2E7k0tP8n790hSuDO3tYe8Y7Om2 1CyL46/Rd0HpSfyk9n81 =y4Fn -----END PGP SIGNATURE-----
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 25/06/2015 14:42, Guillaume Lefranc wrote:
Well, if you had said that your shell was busybox in the first place, that would have saved us a lot of time... :-)
It's not my shell, my shell is regular bash, but many other tools, yes.
Connecting to mysql daemon still fails with the same error, but at least lsof is now working as expected. That's progress :)
2015-06-25 14:41 GMT+02:00 Sylvain Raybaud <sylvain.raybaud@green-communications.fr mailto:sylvain.raybaud@green-communications.fr>:
Guillaume,
On 25/06/2015 13:40, Guillaume Lefranc wrote:
Just a suggestion, you can try adding "set -x" to /usr/bin/wsrep_sst_rsync, so the script will dump its output in the log. You should be able to know where it hangs precisely then.
It gives me the following sequence repating forever:
+ check_pid_and_port /var/lib/mysql//rsync_sst.pid 20189 4444 + local pid_file=/var/lib/mysql//rsync_sst.pid + local rsync_pid=20189 + local rsync_port=4444 + which lsof ++ lsof -i :4444 -Pn ++ grep '(LISTEN)' + local port_info= ++ echo ++ grep -w '^rsync[[:space:]]+20189' + local is_rsync= + '[' -n '' -a -z '' ']'
It seems to correspond to lines 281--284: until check_pid_and_port $RSYNC_PID $RSYNC_REAL_PID $RSYNC_PORT do sleep 0.2 done
check_pid_and_port seems to be checking that rsync is running and listening, basically. Strange thing is: lsof -i :4444 -Pn doesn't return anything although ps shows that rsync was invoked correctly: rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
Actually, lsof -i :4444 -Pn seems to behave rather differently on my laptop (ubuntu) and on buildroot. Indeed, lsof in buildroot is provided by busybox by default. This sometimes leads to significant differences. I'm going to rebuild my system with the real lsof package and see if it gets better. I'll let you know.
You can also try to run the SST command manually on the nodes, and see what it does. You can get the full command output in ps so you're free to start a donor on one node and a joiner on another node and follow the script output.
I did, and it fails because some variables are unbound. I think this is specific to manual invokation.
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net mailto:maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
- -- Sylvain Raybaud www.green-communications.fr
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi all
Finally I managed to get everything up and running by using regular (i.e. non busybox) versions of:
* sleep * ps * xargs
Hope this helps.
Next steps are, not necessarily in this order:
* propose patches for optional component that do not build in buildroot * create a buildroot package for xtrabackup * package and use galera arbitrator * create scripts for automating cluster initialisation and node joining (the idea is to have mariadb galera cluster working in a very unstable network)
I'll let you know.
Cheers,
Sylvain
On 25/06/2015 15:17, Sylvain Raybaud wrote:
On 25/06/2015 14:42, Guillaume Lefranc wrote:
Well, if you had said that your shell was busybox in the first place, that would have saved us a lot of time... :-)
It's not my shell, my shell is regular bash, but many other tools, yes.
Connecting to mysql daemon still fails with the same error, but at least lsof is now working as expected. That's progress :)
2015-06-25 14:41 GMT+02:00 Sylvain Raybaud <sylvain.raybaud@green-communications.fr mailto:sylvain.raybaud@green-communications.fr>:
Guillaume,
On 25/06/2015 13:40, Guillaume Lefranc wrote:
Just a suggestion, you can try adding "set -x" to /usr/bin/wsrep_sst_rsync, so the script will dump its output in the log. You should be able to know where it hangs precisely then.
It gives me the following sequence repating forever:
+ check_pid_and_port /var/lib/mysql//rsync_sst.pid 20189 4444 + local pid_file=/var/lib/mysql//rsync_sst.pid + local rsync_pid=20189 + local rsync_port=4444 + which lsof ++ lsof -i :4444 -Pn ++ grep '(LISTEN)' + local port_info= ++ echo ++ grep -w '^rsync[[:space:]]+20189' + local is_rsync= + '[' -n '' -a -z '' ']'
It seems to correspond to lines 281--284: until check_pid_and_port $RSYNC_PID $RSYNC_REAL_PID $RSYNC_PORT do sleep 0.2 done
check_pid_and_port seems to be checking that rsync is running and listening, basically. Strange thing is: lsof -i :4444 -Pn doesn't return anything although ps shows that rsync was invoked correctly: rsync --daemon --no-detach --port 4444 --config /var/lib/mysql//rsync_sst.conf
Actually, lsof -i :4444 -Pn seems to behave rather differently on my laptop (ubuntu) and on buildroot. Indeed, lsof in buildroot is provided by busybox by default. This sometimes leads to significant differences. I'm going to rebuild my system with the real lsof package and see if it gets better. I'll let you know.
You can also try to run the SST command manually on the nodes, and see what it does. You can get the full command output in ps so you're free to start a donor on one node and a joiner on another node and follow the script output.
I did, and it fails because some variables are unbound. I think this is specific to manual invokation.
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net mailto:maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
- -- Sylvain Raybaud www.green-communications.fr
participants (2)
-
Guillaume Lefranc
-
Sylvain Raybaud