Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency
From: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
When using semi-sync replication with rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the primary can significantly reduce compared to AFTER_SYNC's performance for workloads with many concurrent users executing transactions. This is because all connections on the primary share
Hi Brandon, Here is my second round review of the MDEV-33551 patch. I think these comments of mine from first round still need to be addressed:
A bit of care must be taken about lifetime of Tranx_node and THD respectively (in case one goes away but not the other). But the whole
+void Repl_semi_sync_master::await_all_slave_replies()
+ wait_result= cond_timewait(front.thd, &abstime);
I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready? Isn't there a possibility that this might go away during the wait?
I tried a simple testcase (attached) which disconnects the session thread while await_all_slave_replies() is running during shutdown. As I expected, I could see the COND_wakeup_ready of the session thread being de-allocated while await_all_slave_replies() is still waking up from that condition. That cannot be correct. Similarly, it seems clearly possible that a session thread is killed and exits while its entry is still present in the m_active_tranxs list, which will similarly lead to signalling a de-allocated COND_wakeup_ready. When a thread aborts the wait before the entry is removed, I think it should simply look up its entry (can use the m_trx_htb hash table, or even just iterate the list) and set its THD to NULL (or remove the entry, whatever is simpler). For the await_all_slave_replies() problem, I think the whole wait-at-shutdown is much too complex as it is now. The whole is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not killing threads waiting for semi-sync-ack, and now introducing unmark_thd_awaiting_ack. And it still has the bug waiting on a COND_wakeup_ready of another THD that can disappear during the wait. I think all of that should just go away, the wait in await_all_slave_replies() should be done much simpler with a dedicated condition variable to wait for. Eg. await_all_slave_replies() just does a mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog position is ack'ed, and this condition is then signalled from report_reply_binlog() (either always, or await_all_slave_replies() can set a flag to request signal). Or have the shutdown thread enqueue its THD at the end of the m_active_tranxs list. Let's avoid all this complex interaction between SHUTDOWN WAIT FOR ALL SLAVES and session threads. BTW, COND_binlog_send is currently unused in your patch. Below detailed comments on individual parts of the patch:
--- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -1750,18 +1750,12 @@ static void close_connections(void)
- if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled()) + if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled() && + repl_semisync_master.sync_get_master_wait_sessions())
Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait unconditionally for the last binlog position sent?
--- a/sql/semisync_master.cc +++ b/sql/semisync_master.cc @@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct timespec *ts) return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec / TIME_THOUSAND; }
+int signal_waiting_transaction(THD *waiting_thd, const char *binlog_file, + my_off_t binlog_pos) +{ + if (waiting_thd->is_awaiting_semisync_ack) + mysql_cond_signal(&waiting_thd->COND_wakeup_ready);
The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD will be waiting, otherwise it would not be in the list in the first place, so faster with an occasional redundant signal than a conditional always. And if the THD is not waiting, it's not safe to access it anyway. Instead, as mentioned above, have a check for thd != NULL here, and let a non-waiting thread clear the THD in its entry before it releases the LOCK_binlog.
-int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time) +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *_wait_time)
+ if (_wait_time == NULL) + { + create_timeout(&gen_wait_time, NULL); + real_wait_time= &gen_wait_time; + } + else + { + real_wait_time= _wait_time; + }
Don't pass in booleans that are always hard-coded true or false in the caller. Instead create two different functions, one that takes a time_spec, and one that does not (the latter creating it and calling the former).
@@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave() Signal transactions waiting in commit_trx() that they do not have to wait anymore. */ - cond_broadcast(); + m_active_tranxs->clear_active_tranx_nodes(NULL, NULL, + signal_waiting_transaction);
This does not compile on my machine (the second NULL is passed to my_off_t position), is that a last-minute change that was never tested? Or NULL is defined differently on your machine, so it doesn't fail to compile? BTW, question about the old code: Why didn't the old code clear the m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and if so where)?
-void Repl_semi_sync_master::await_slave_reply() +void Repl_semi_sync_master::await_all_slave_replies()
+ The lock is taken per iteration rather than over the whole loop to allow + more opportunity for the user threads to handle their ACKs.
That sounds like a mis-understanding. The lock is always released by mysql_cond_timedwait() in each iteration of the loop, no need to release/take it again for just a couple instructions.
+ while (TRUE) + { + lock(); + if (m_active_tranxs->is_empty() || !get_master_enabled() || !is_on()) + { + unlock(); + break; + } + wait_result= m_active_tranxs->for_front( + [](THD *thd, const char *binlog_file, my_off_t binlog_pos) -> int { + return (thd->is_awaiting_semisync_ack) + ? repl_semisync_master.cond_timewait(thd, NULL) + : 0; + });
As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of another THD, since that THD might go away during the wait. Instead, let's do a specific wait for the end position here, either on COND_binlog or by enqueing an extra THD at the end. (And btw, in your code, why wait for the _front_ of the list and not the _rear_?)
diff --git a/sql/semisync_master.h b/sql/semisync_master.h index 99f46869354..1eb690b572b 100644 --- a/sql/semisync_master.h +++ b/sql/semisync_master.h
+ * The pre_delete_hook parameter is a function pointer that will be invoked + * for each Active_tranx node, in order, from m_trx_front to log_file_name, + * e.g. to signal their wakeup condition. Repl_semi_sync_binlog::LOCK_binlog + * is held while this is invoked.
"from m_trx_front to log_file_name", I don't understand? (Typo s/log_file_name/m_trx_rear/ ?)
+/* + Element in Repl_semi_sync_master::wait_queue to preserve the state of a + transaction waiting for an ACK. +*/ +typedef struct _semisync_wait_trx { + const char *binlog_name; + my_off_t binlog_pos; + THD *thd; +} semisync_wait_trx_t; +
This is not longer used (should be removed).
* Input: (the transaction events' ending binlog position) * trx_wait_binlog_name - (IN) ending position's file name * trx_wait_binlog_pos - (IN) ending position's file offset + * unmark_thd_awaiting_ack - (IN) Whether or not to unmark + * thd->is_awaiting_semisync_ack after receiving + * an ACK from the replica. If using wait_point + * AFTER_SYNC with binlog_group_commit, only the + * last transaction written to binlog in the + * group should negate the boolean, because the + * same thread (i.e. leader thread) will wait for + * all transaction ACKs. Negating it too early + * would break SHUTDOWN WAIT FOR ALL SLAVES, as + * that is the condition tested to bypass killing + * threads in phase 1. In all other situations, + * the boolean should be negated immediately.
That seems unnecessarily convoluted. I think the real problem here is that write_tranx_in_binlog() sets the is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead of the trx_thd. But once we fix the await_all_slave_replies(), do we even need the is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point becomes moot anyway. - Kristian. --source include/have_innodb.inc --source include/have_debug.inc --source include/have_binlog_format_mixed.inc --source include/master-slave.inc --connection master CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; INSERT INTO t1 VALUES (1,0); --sync_slave_with_master source include/stop_slave.inc; SET @old_semi_enabled= @@GLOBAL.rpl_semi_sync_slave_enabled; SET GLOBAL rpl_semi_sync_slave_enabled = 1; --source include/start_slave.inc --connection master SET GLOBAL rpl_semi_sync_master_enabled= 1; SET GLOBAL rpl_semi_sync_master_timeout= 5000; let $status_var= Rpl_semi_sync_master_clients; let $status_var_value= 1; source include/wait_for_status_var.inc; show status like 'Rpl_semi_sync_master_status'; show status like 'Rpl_semi_sync_master_clients'; --connection slave SET @old_dbug= @@GLOBAL.debug_dbug; SET GLOBAL debug_dbug="+d,simulate_delay_semisync_slave_reply"; --connect(con1, localhost, root,,) --connect(con2, localhost, root,,) --connection con1 send INSERT INTO t1 VALUES (2,0); --sleep 0.3 --disconnect con1 --connection default --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect wait EOF #let $status_var= Rpl_semi_sync_master_wait_sessions; #let $status_var_value= 1; #source include/wait_for_status_var.inc; --sleep 0.1 --connection con2 SHUTDOWN WAIT FOR ALL SLAVES; --disconnect con2 --connection default --source include/wait_until_disconnected.inc --connection master --source include/wait_until_disconnected.inc --connection server_1 --source include/wait_until_disconnected.inc --connection default --append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect restart EOF --enable_reconnect --source include/wait_until_connected_again.inc --connection master --enable_reconnect --source include/wait_until_connected_again.inc --connection server_1 --enable_reconnect --source include/wait_until_connected_again.inc --save_master_pos --connection slave SET GLOBAL debug_dbug= @old_dbug; SET GLOBAL rpl_semi_sync_slave_enabled= @old_semi_enabled; --source include/stop_slave.inc --source include/start_slave.inc --sync_with_master --connection default DROP TABLE t1; --source include/rpl_end.inc
Thanks for the review, Kristian! I agree with your summary, see my replies to your specific points below. On Tue, Mar 12, 2024 at 5:29 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
From: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
When using semi-sync replication with rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the primary can significantly reduce compared to AFTER_SYNC's performance for workloads with many concurrent users executing transactions. This is because all connections on the primary share
Hi Brandon,
Here is my second round review of the MDEV-33551 patch.
I think these comments of mine from first round still need to be addressed:
A bit of care must be taken about lifetime of Tranx_node and THD respectively (in case one goes away but not the other). But the whole
+void Repl_semi_sync_master::await_all_slave_replies()
+ wait_result= cond_timewait(front.thd, &abstime);
I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready? Isn't there a possibility that this might go away during the wait?
I tried a simple testcase (attached) which disconnects the session thread while await_all_slave_replies() is running during shutdown. As I expected, I could see the COND_wakeup_ready of the session thread being de-allocated while await_all_slave_replies() is still waking up from that condition. That cannot be correct.
Similarly, it seems clearly possible that a session thread is killed and exits while its entry is still present in the m_active_tranxs list, which will similarly lead to signalling a de-allocated COND_wakeup_ready.
When a thread aborts the wait before the entry is removed, I think it should simply look up its entry (can use the m_trx_htb hash table, or even just iterate the list) and set its THD to NULL (or remove the entry, whatever is simpler).
For the await_all_slave_replies() problem, I think the whole wait-at-shutdown is much too complex as it is now. The whole is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not killing threads waiting for semi-sync-ack, and now introducing unmark_thd_awaiting_ack. And it still has the bug waiting on a COND_wakeup_ready of another THD that can disappear during the wait.
I think all of that should just go away, the wait in await_all_slave_replies() should be done much simpler with a dedicated condition variable to wait for. Eg. await_all_slave_replies() just does a mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog position is ack'ed, and this condition is then signalled from report_reply_binlog() (either always, or await_all_slave_replies() can set a flag to request signal). Or have the shutdown thread enqueue its THD at the end of the m_active_tranxs list. Let's avoid all this complex interaction between SHUTDOWN WAIT FOR ALL SLAVES and session threads.
BTW, COND_binlog_send is currently unused in your patch.
Below detailed comments on individual parts of the patch:
--- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -1750,18 +1750,12 @@ static void close_connections(void)
- if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled()) + if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled() && + repl_semisync_master.sync_get_master_wait_sessions())
Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait unconditionally for the last binlog position sent?
Ah this was an original simplification before I changed where set_awaiting_semisync_ack was set, but leaving it so is a bug, as we can have items in the wait list, yet not be waiting on the ACK. Even with the fix, I wonder if we should also move where we increase the wait_sessions variable to be when we enqueue into Active_tranx (currently it is in commit_trx() just before the wait). I suppose it depends on how we want to define a wait session. I.e., when we implement group ACK, how do we want to report the leader actually waiting for ACK, vs non-leaders waiting on the leader's ACK.
--- a/sql/semisync_master.cc +++ b/sql/semisync_master.cc @@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct timespec *ts) return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec / TIME_THOUSAND; }
+int signal_waiting_transaction(THD *waiting_thd, const char *binlog_file, + my_off_t binlog_pos) +{ + if (waiting_thd->is_awaiting_semisync_ack) + mysql_cond_signal(&waiting_thd->COND_wakeup_ready);
The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD will be waiting, otherwise it would not be in the list in the first place, so faster with an occasional redundant signal than a conditional always. And if the THD is not waiting, it's not safe to access it anyway.
Instead, as mentioned above, have a check for thd != NULL here, and let a non-waiting thread clear the THD in its entry before it releases the LOCK_binlog.
Ah, that makes more sense. Will do.
-int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time) +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *_wait_time)
+ if (_wait_time == NULL) + { + create_timeout(&gen_wait_time, NULL); + real_wait_time= &gen_wait_time; + } + else + { + real_wait_time= _wait_time; + }
Don't pass in booleans that are always hard-coded true or false in the caller. Instead create two different functions, one that takes a time_spec, and one that does not (the latter creating it and calling the former).
Sure.
@@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave() Signal transactions waiting in commit_trx() that they do not have to wait anymore. */ - cond_broadcast(); + m_active_tranxs->clear_active_tranx_nodes(NULL, NULL, + signal_waiting_transaction);
This does not compile on my machine (the second NULL is passed to my_off_t position), is that a last-minute change that was never tested? Or NULL is defined differently on your machine, so it doesn't fail to compile?
Ah, interesting. With clang it compiles, but with GCC it does not. I'll fix it to `0`.
BTW, question about the old code: Why didn't the old code clear the m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and if so where)?
It did not clear it, but should have.
-void Repl_semi_sync_master::await_slave_reply() +void Repl_semi_sync_master::await_all_slave_replies()
+ The lock is taken per iteration rather than over the whole loop to allow + more opportunity for the user threads to handle their ACKs.
That sounds like a mis-understanding. The lock is always released by mysql_cond_timedwait() in each iteration of the loop, no need to release/take it again for just a couple instructions.
+ while (TRUE) + { + lock(); + if (m_active_tranxs->is_empty() || !get_master_enabled() || !is_on()) + { + unlock(); + break; + } + wait_result= m_active_tranxs->for_front( + [](THD *thd, const char *binlog_file, my_off_t binlog_pos) -> int { + return (thd->is_awaiting_semisync_ack) + ? repl_semisync_master.cond_timewait(thd, NULL) + : 0; + });
As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of another THD, since that THD might go away during the wait. Instead, let's do a specific wait for the end position here, either on COND_binlog or by enqueing an extra THD at the end.
Sure, I'll keep COND_binlog around and use that (the main thread doesn't have a THD, and creating one just to enqueue into the list would be more overhead than keeping COND_binlog).
(And btw, in your code, why wait for the _front_ of the list and not the _rear_?)
Ah, that was just keeping the original logic from pre-patch (to be somewhat consistent). It would have made sense to optimize to just be the rear, with the new logic.
diff --git a/sql/semisync_master.h b/sql/semisync_master.h index 99f46869354..1eb690b572b 100644 --- a/sql/semisync_master.h +++ b/sql/semisync_master.h
+ * The pre_delete_hook parameter is a function pointer that will be invoked + * for each Active_tranx node, in order, from m_trx_front to log_file_name, + * e.g. to signal their wakeup condition. Repl_semi_sync_binlog::LOCK_binlog + * is held while this is invoked.
"from m_trx_front to log_file_name", I don't understand? (Typo s/log_file_name/m_trx_rear/ ?)
Huh. Big typo, my intelli-"sense" must have filled it in.
+/* + Element in Repl_semi_sync_master::wait_queue to preserve the state of a + transaction waiting for an ACK. +*/ +typedef struct _semisync_wait_trx { + const char *binlog_name; + my_off_t binlog_pos; + THD *thd; +} semisync_wait_trx_t; +
This is not longer used (should be removed).
Right, thanks.
* Input: (the transaction events' ending binlog position) * trx_wait_binlog_name - (IN) ending position's file name * trx_wait_binlog_pos - (IN) ending position's file offset + * unmark_thd_awaiting_ack - (IN) Whether or not to unmark + * thd->is_awaiting_semisync_ack after
+ * an ACK from the replica. If using wait_point + * AFTER_SYNC with binlog_group_commit, only the + * last transaction written to binlog in
receiving the
+ * group should negate the boolean, because the + * same thread (i.e. leader thread) will wait for + * all transaction ACKs. Negating it too early + * would break SHUTDOWN WAIT FOR ALL SLAVES, as + * that is the condition tested to bypass killing + * threads in phase 1. In all other situations, + * the boolean should be negated immediately.
That seems unnecessarily convoluted. I think the real problem here is that write_tranx_in_binlog() sets the is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead of the trx_thd.
But once we fix the await_all_slave_replies(), do we even need the is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point becomes moot anyway.
Ah, right, this is actually a bug as is, as when we are in kill_threads_phase_1 we test for is_awaiting_semi_sync_ack, which would only be set on the leader binlog thread.. So then in kill_threads_phase_1, we'd just check if the THD is in Active_tranx to see if we should skip the kill.
- Kristian.
In summary 1) Fix close_connections() check from wait_sessions to instead, always wait for the last element in the queue 2) Fix aborted threads to set their THD to null (with respective NULL thd checks, where appropriate) 3) Fix await_all_slave_replies() to use COND_binlog (signalled on last slave reply) 4) Remove THD::is_awaiting_semisync_ack, change kill_threads_phase_1 to check if THD is in Active_tranx. 5) Fix comment typo s/log_file_name/m_trx_rear/ 6) Remove struct _semisync_wait_trx 7) Split cond_timewait into two functions, one with a time_spec param, and one without 8) Fix clear_active_tranx_nodes NULL param Thanks again, Brandon
Brandon Nesterenko <brandon.nesterenko@mariadb.com> writes:
Sure, I'll keep COND_binlog around and use that (the main thread doesn't have a THD, and creating one just to enqueue into the list would be more overhead than keeping COND_binlog).
Agree, no need to create another THD.
But once we fix the await_all_slave_replies(), do we even need the is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point becomes moot anyway.
Ah, right, this is actually a bug as is, as when we are in kill_threads_phase_1 we test for is_awaiting_semi_sync_ack, which would only be set on the leader binlog thread.. So then in kill_threads_phase_1, we'd just check if the THD is in Active_tranx to see if we should skip the kill.
I still don't understand why we want to skip the kill? I think the point of SHUTDOWN WAIT FOR ALL SLAVES is to ensure that slaves have time to get all transactions that were binlogged on the master, before it shuts down. This will be ensured by await_all_slave_replies(), which I think can be done without relying on any other session/THD waiting for specific ACKs. In fact, any session could in principle be killed anyway (by the user), and we would still need await_all_slave_replies() to work, right? Or a session could be just before or after waiting for its ACK, and then it will be killed anyway. So I don't understand why the shutdown should skip killing these specific sessions in this specific case? On the other hand, I'm not sure this is that important, can't really think of any harm in skipping the kill either. I'm mostly trying to understand the reason for doing this. - Kristian.
On Fri, Mar 15, 2024 at 12:22 PM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
Brandon Nesterenko <brandon.nesterenko@mariadb.com> writes:
But once we fix the await_all_slave_replies(), do we even need the is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point becomes moot anyway.
Ah, right, this is actually a bug as is, as when we are in kill_threads_phase_1 we test for is_awaiting_semi_sync_ack, which would only be set on the leader binlog thread.. So then in kill_threads_phase_1, we'd just check if the THD is in Active_tranx to see if we should skip the kill.
I still don't understand why we want to skip the kill?
I think the point of SHUTDOWN WAIT FOR ALL SLAVES is to ensure that slaves have time to get all transactions that were binlogged on the master, before it shuts down. This will be ensured by await_all_slave_replies(), which I think can be done without relying on any other session/THD waiting for specific ACKs.
In fact, any session could in principle be killed anyway (by the user), and we would still need await_all_slave_replies() to work, right? Or a session could be just before or after waiting for its ACK, and then it will be killed anyway.
So I don't understand why the shutdown should skip killing these specific sessions in this specific case?
On the other hand, I'm not sure this is that important, can't really think of any harm in skipping the kill either. I'm mostly trying to understand the reason for doing this.
I see your point. Though likely not too important, it adds complexity to the code, so I've reverted that part (originally added by MDEV-11853). The PR (PR 3089 / branch 10.6-MDEV-33551) has been updated per your latest review. Brandon
participants (2)
-
Brandon Nesterenko
-
Kristian Nielsen