[Maria-discuss] Backup on the replication server getting affected
show processlist; \! date +--------+-------------+----------------------+------+--------------+--------+-----------------------------------------------+-----------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +--------+-------------+----------------------+------+--------------+--------+-----------------------------------------------+-----------------------+----------+ | 8 | system user | | NULL | Slave_IO | 600078 | Waiting for master to send event | NULL | 0.000 | | 704611 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704614 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704612 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704613 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704615 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704617 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704616 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704619 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704618 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704627 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704620 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704621 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704622 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704623 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704625 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704629 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704626 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704624 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704632 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704628 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704630 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704636 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704631 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704634 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704637 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704633 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704635 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704643 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704638 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704639 | system user | | NULL | Slave_worker | 113000 | closing tables | NULL | 0.000 | | 704641 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704642 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704651 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704652 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704645 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704654 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704648 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704646 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704649 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704656 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704650 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704644 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704657 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704653 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704640 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704655 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704647 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704658 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704610 | system user | | NULL | Slave_SQL | 113039 | Waiting for room in worker thread event queue | NULL | 0.000 | | 733190 | root | 172.16.117.210:49448 | NULL | Query | 46259 | Killing slave | STOP SLAVE SQL_THREAD | 0.000 | | 733618 | root | 172.16.117.210:55994 | naf | Query | 0 | starting | show
Hi team, I have facing a replication issue in my DB setup where we have a master-slave server and replication is ON between the servers. *Environment:* MariaDB 10.6.11 *DB size: *approx. 1TB While taking mariabackup, at the stage of preparing backup I see some interpretation in replication which I can see in MySQL logs. Backup was successful but the replication is not catching up with the master and able to see the slave worker are getting stuck forever (as per Processlist). Even if we stop the slave SQL thread or stop slave not fixing the issue. (WIthout backup process the replication is working fine without any delays) *Error Log:* [Note] Error reading relay log event: slave SQL thread was killed [Note] Slave SQL thread exiting, replication stopped in log 'binary-log.015277' at position 266164018; GTID position '0-2-439338736', master: 172.16.117.178:3307 [Note] Slave SQL thread initialized, starting replication in log 'binary-log.015277' at position 266164018, relay log './mysql-1-relay-bin.004330' position: 94753264; GTID position '0-2-439338736' *ProcessList:* processlist | 0.000 | +--------+-------------+----------------------+------+--------------+--------+-----------------------------------------------+-----------------------+----------+ 52 rows in set (0.000 sec) Wed Apr 5 14:10:00 CEST 2023 Replication as a Backup Solution - MariaDB Knowledge Base <https://mariadb.com/kb/en/replication-as-a-backup-solution/> as per the page "Running the backup from a slave has the advantage of being able to shutdown or lock the slave and perform a backup without any impact on the primary server." *Does it mean running a backup in a slave might impact the replication? If this is expected behavior, do we have any proper way to achieve a backup on a daily basis in the replication server?* *Or The backup is not the problem and some other factor (**like a deadlock**) is affecting the slave thread to lock?* *Regards,* *Ragul R*
Howyd Ragul,
Hi team,
I have facing a replication issue in my DB setup where we have a master-slave server and replication is ON between the servers.
Environment: MariaDB 10.6.11 DB size: approx. 1TB
While taking mariabackup, at the stage of preparing backup I see some interpretation in replication which I can see in MySQL logs. Backup was successful but the replication is not catching up with the master and able to see the slave worker are getting stuck forever (as per Processlist). Even if we stop the slave SQL thread or stop slave not fixing the issue. (WIthout backup process the replication is working fine without any delays)
By the show-processlist I could suspect MDEV-30780 optimistic parallel slave hangs after hit an error If you can reproduce it, could you please file a Jira https://jira.mariadb.org/ ticket to include 1. mysqlbinlog output of the replication events (you may need a master binlog for that) being executed by all workers at that time (find the last executed through Show-Slave-Status, or @@global.gtid_slave_pos). 2. slave error log 3. execute on slave gdb -ex 'set height 0' -ex 'thread apply all backtrace' -p "find yourself mariadb-pid"
| 704638 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704639 | system user | | NULL | Slave_worker | 113000 | closing tables | NULL | 0.000 | | 704641 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 |
To
Replication as a Backup Solution - MariaDB Knowledge Base as per the page "Running the backup from a slave has the advantage of being able to shutdown or lock the slave and perform a backup without any impact on the primary server." Does it mean running a backup in a slave might impact the replication?
Not really. The sentance merely says the master server performance won't be affected when one takes backup on the slave.
If this is expected behavior, do we have any proper way to achieve a backup on a daily basis in the replication server?
Or The backup is not the problem and some other factor (like a deadlock) is affecting the slave thread to lock?
I would think of deadlock. Cheers, Andrei
Thanks Andrei, Hope my issue is more related to the issue MDEV-30780 optimistic parallel slave hangs after hit an error Trying to reproduce with a minimal database. Attaching the gbd output Regards, Ragul R On Mon, May 22, 2023 at 2:59 PM <andrei.elkin@pp.inet.fi> wrote:
Howyd Ragul,
Hi team,
I have facing a replication issue in my DB setup where we have a master-slave server and replication is ON between the servers.
Environment: MariaDB 10.6.11 DB size: approx. 1TB
While taking mariabackup, at the stage of preparing backup I see some interpretation in replication which I can see in MySQL logs. Backup was successful but the replication is not catching up with the master and able to see the slave worker are getting stuck forever (as per Processlist). Even if we stop the slave SQL thread or stop slave not fixing the issue. (WIthout backup process the replication is working fine without any delays)
By the show-processlist I could suspect MDEV-30780 optimistic parallel slave hangs after hit an error
If you can reproduce it, could you please file a Jira https://jira.mariadb.org/ ticket to include 1. mysqlbinlog output of the replication events (you may need a master binlog for that) being executed by all workers at that time (find the last executed through Show-Slave-Status, or @@global.gtid_slave_pos). 2. slave error log 3. execute on slave gdb -ex 'set height 0' -ex 'thread apply all backtrace' -p "find yourself mariadb-pid"
| 704638 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 | | 704639 | system user | | NULL | Slave_worker | 113000 | closing tables | NULL | 0.000 | | 704641 | system user | | NULL | Slave_worker | 113000 | Waiting for prior transaction to commit | NULL | 0.000 |
To
Replication as a Backup Solution - MariaDB Knowledge Base as per the page "Running the backup from a slave has the advantage of being able to shutdown or lock the slave and perform a backup without any impact on the primary server." Does it mean running a backup in a slave might impact the replication?
Not really. The sentance merely says the master server performance won't be affected when one takes backup on the slave.
If this is expected behavior, do we have any proper way to achieve a backup on a daily basis in the replication server?
Or The backup is not the problem and some other factor (like a deadlock) is affecting the slave thread to lock?
I would think of deadlock.
Cheers,
Andrei
ragul rangarajan <ragulrangarajan@gmail.com> writes:
Hope my issue is more related to the issue MDEV-30780 optimistic parallel slave hangs after hit an error Trying to reproduce with a minimal database.
Attaching the gbd output
Thanks, that gdb output is really helpful! I agree with Andrei that this rules out MDEV-30780 as the cause. Instead it looks to be caused by MDEV-29843, see also MDEV-31427: https://jira.mariadb.org/browse/MDEV-29843 https://jira.mariadb.org/browse/MDEV-31427 This is seen in the stack trace, where all the other worker threads are waiting on one which is stuck inside pthread_cond_signal: ----------------------------------------------------------------------- Thread 80 (Thread 0x7f47ad065700 (LWP 25417)): #0 0x00007f789dca054d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f789dc9e14d in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #2 0x000055de401c23cd in inline_mysql_cond_signal (that=0x7f4798006b78) at /home/buildbot/buildbot/build/include/mysql/psi/mysql_thread.h:1099 #3 dec_pending_ops (state=<synthetic pointer>, this=0x7f4798006b30) at /home/buildbot/buildbot/build/sql/sql_class.h:2535 #4 thd_decrement_pending_ops (thd=0x7f47980009b8) at /home/buildbot/buildbot/build/sql/sql_class.cc:5142 #5 0x000055de407b5726 in group_commit_lock::release (this=this@entry=0x55de41f0da80 <write_lock>, num=num@entry=216757233923465) at /home/buildbot/buildbot/build/storage/innobase/log/log0sync.cc:388 #6 0x000055de407a0a3c in log_write_up_to (lsn=<optimized out>, lsn@entry=216757233923297, flush_to_disk=flush_to_disk@entry=false, rotate_key=rotate_key@entry=false, callback=<optimized out>, callback@entry=0x7f47ad064090) at /home/buildbot/buildbot/build/storage/innobase/log/log0log.cc:844 ----------------------------------------------------------------------- The pthread_cond_signal() function normally can never block, so this indicates some corruption of the underlying condition object. This object is used to asynchroneously complete a query on a client connection when using the thread pool. The MDEV-29843 patch makes worker threads not use this asynchroneous completion, which should eliminate this problem. The stack trace strongly indicates MDEV-29843 as the cause. Except that MDEV-29843 patch is supposed to be in MariaDB 10.6.11, and you wrote:
Environment: MariaDB 10.6.11
Can you double-check if you are really seing this hang in 10.6.11, or if it could have been 10.6.10 (the only version that is supposed to be vulnerable to MDEV-29843)? Another thing you can check is if you are using --thread-handling=pool-of-threads, which I think is related to the MDEV-29843 issue. In MDEV-31427 I suggest --thread-handling=one-thread-per-connection as a possible work-around. Hope this helps, - Kristian.
participants (3)
-
andrei.elkin@pp.inet.fi
-
Kristian Nielsen
-
ragul rangarajan