From: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
When using semi-sync replication with rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the primary can significantly reduce compared to AFTER_SYNC's performance for workloads with many concurrent users executing transactions. This is because all connections on the primary share
Hi Brandon, Here is my second round review of the MDEV-33551 patch. I think these comments of mine from first round still need to be addressed:
A bit of care must be taken about lifetime of Tranx_node and THD respectively (in case one goes away but not the other). But the whole
+void Repl_semi_sync_master::await_all_slave_replies()
+ wait_result= cond_timewait(front.thd, &abstime);
I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready? Isn't there a possibility that this might go away during the wait?
I tried a simple testcase (attached) which disconnects the session thread while await_all_slave_replies() is running during shutdown. As I expected, I could see the COND_wakeup_ready of the session thread being de-allocated while await_all_slave_replies() is still waking up from that condition. That cannot be correct. Similarly, it seems clearly possible that a session thread is killed and exits while its entry is still present in the m_active_tranxs list, which will similarly lead to signalling a de-allocated COND_wakeup_ready. When a thread aborts the wait before the entry is removed, I think it should simply look up its entry (can use the m_trx_htb hash table, or even just iterate the list) and set its THD to NULL (or remove the entry, whatever is simpler). For the await_all_slave_replies() problem, I think the whole wait-at-shutdown is much too complex as it is now. The whole is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not killing threads waiting for semi-sync-ack, and now introducing unmark_thd_awaiting_ack. And it still has the bug waiting on a COND_wakeup_ready of another THD that can disappear during the wait. I think all of that should just go away, the wait in await_all_slave_replies() should be done much simpler with a dedicated condition variable to wait for. Eg. await_all_slave_replies() just does a mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog position is ack'ed, and this condition is then signalled from report_reply_binlog() (either always, or await_all_slave_replies() can set a flag to request signal). Or have the shutdown thread enqueue its THD at the end of the m_active_tranxs list. Let's avoid all this complex interaction between SHUTDOWN WAIT FOR ALL SLAVES and session threads. BTW, COND_binlog_send is currently unused in your patch. Below detailed comments on individual parts of the patch:
--- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -1750,18 +1750,12 @@ static void close_connections(void)
- if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled()) + if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled() && + repl_semisync_master.sync_get_master_wait_sessions())
Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait unconditionally for the last binlog position sent?
--- a/sql/semisync_master.cc +++ b/sql/semisync_master.cc @@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct timespec *ts) return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec / TIME_THOUSAND; }
+int signal_waiting_transaction(THD *waiting_thd, const char *binlog_file, + my_off_t binlog_pos) +{ + if (waiting_thd->is_awaiting_semisync_ack) + mysql_cond_signal(&waiting_thd->COND_wakeup_ready);
The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD will be waiting, otherwise it would not be in the list in the first place, so faster with an occasional redundant signal than a conditional always. And if the THD is not waiting, it's not safe to access it anyway. Instead, as mentioned above, have a check for thd != NULL here, and let a non-waiting thread clear the THD in its entry before it releases the LOCK_binlog.
-int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time) +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *_wait_time)
+ if (_wait_time == NULL) + { + create_timeout(&gen_wait_time, NULL); + real_wait_time= &gen_wait_time; + } + else + { + real_wait_time= _wait_time; + }
Don't pass in booleans that are always hard-coded true or false in the caller. Instead create two different functions, one that takes a time_spec, and one that does not (the latter creating it and calling the former).
@@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave() Signal transactions waiting in commit_trx() that they do not have to wait anymore. */ - cond_broadcast(); + m_active_tranxs->clear_active_tranx_nodes(NULL, NULL, + signal_waiting_transaction);
This does not compile on my machine (the second NULL is passed to my_off_t position), is that a last-minute change that was never tested? Or NULL is defined differently on your machine, so it doesn't fail to compile? BTW, question about the old code: Why didn't the old code clear the m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and if so where)?
-void Repl_semi_sync_master::await_slave_reply() +void Repl_semi_sync_master::await_all_slave_replies()
+ The lock is taken per iteration rather than over the whole loop to allow + more opportunity for the user threads to handle their ACKs.
That sounds like a mis-understanding. The lock is always released by mysql_cond_timedwait() in each iteration of the loop, no need to release/take it again for just a couple instructions.
+ while (TRUE) + { + lock(); + if (m_active_tranxs->is_empty() || !get_master_enabled() || !is_on()) + { + unlock(); + break; + } + wait_result= m_active_tranxs->for_front( + [](THD *thd, const char *binlog_file, my_off_t binlog_pos) -> int { + return (thd->is_awaiting_semisync_ack) + ? repl_semisync_master.cond_timewait(thd, NULL) + : 0; + });
As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of another THD, since that THD might go away during the wait. Instead, let's do a specific wait for the end position here, either on COND_binlog or by enqueing an extra THD at the end. (And btw, in your code, why wait for the _front_ of the list and not the _rear_?)
diff --git a/sql/semisync_master.h b/sql/semisync_master.h index 99f46869354..1eb690b572b 100644 --- a/sql/semisync_master.h +++ b/sql/semisync_master.h
+ * The pre_delete_hook parameter is a function pointer that will be invoked + * for each Active_tranx node, in order, from m_trx_front to log_file_name, + * e.g. to signal their wakeup condition. Repl_semi_sync_binlog::LOCK_binlog + * is held while this is invoked.
"from m_trx_front to log_file_name", I don't understand? (Typo s/log_file_name/m_trx_rear/ ?)
+/* + Element in Repl_semi_sync_master::wait_queue to preserve the state of a + transaction waiting for an ACK. +*/ +typedef struct _semisync_wait_trx { + const char *binlog_name; + my_off_t binlog_pos; + THD *thd; +} semisync_wait_trx_t; +
This is not longer used (should be removed).
* Input: (the transaction events' ending binlog position) * trx_wait_binlog_name - (IN) ending position's file name * trx_wait_binlog_pos - (IN) ending position's file offset + * unmark_thd_awaiting_ack - (IN) Whether or not to unmark + * thd->is_awaiting_semisync_ack after receiving + * an ACK from the replica. If using wait_point + * AFTER_SYNC with binlog_group_commit, only the + * last transaction written to binlog in the + * group should negate the boolean, because the + * same thread (i.e. leader thread) will wait for + * all transaction ACKs. Negating it too early + * would break SHUTDOWN WAIT FOR ALL SLAVES, as + * that is the condition tested to bypass killing + * threads in phase 1. In all other situations, + * the boolean should be negated immediately.
That seems unnecessarily convoluted. I think the real problem here is that write_tranx_in_binlog() sets the is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead of the trx_thd. But once we fix the await_all_slave_replies(), do we even need the is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point becomes moot anyway. - Kristian. --source include/have_innodb.inc --source include/have_debug.inc --source include/have_binlog_format_mixed.inc --source include/master-slave.inc --connection master CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB; INSERT INTO t1 VALUES (1,0); --sync_slave_with_master source include/stop_slave.inc; SET @old_semi_enabled= @@GLOBAL.rpl_semi_sync_slave_enabled; SET GLOBAL rpl_semi_sync_slave_enabled = 1; --source include/start_slave.inc --connection master SET GLOBAL rpl_semi_sync_master_enabled= 1; SET GLOBAL rpl_semi_sync_master_timeout= 5000; let $status_var= Rpl_semi_sync_master_clients; let $status_var_value= 1; source include/wait_for_status_var.inc; show status like 'Rpl_semi_sync_master_status'; show status like 'Rpl_semi_sync_master_clients'; --connection slave SET @old_dbug= @@GLOBAL.debug_dbug; SET GLOBAL debug_dbug="+d,simulate_delay_semisync_slave_reply"; --connect(con1, localhost, root,,) --connect(con2, localhost, root,,) --connection con1 send INSERT INTO t1 VALUES (2,0); --sleep 0.3 --disconnect con1 --connection default --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect wait EOF #let $status_var= Rpl_semi_sync_master_wait_sessions; #let $status_var_value= 1; #source include/wait_for_status_var.inc; --sleep 0.1 --connection con2 SHUTDOWN WAIT FOR ALL SLAVES; --disconnect con2 --connection default --source include/wait_until_disconnected.inc --connection master --source include/wait_until_disconnected.inc --connection server_1 --source include/wait_until_disconnected.inc --connection default --append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect restart EOF --enable_reconnect --source include/wait_until_connected_again.inc --connection master --enable_reconnect --source include/wait_until_connected_again.inc --connection server_1 --enable_reconnect --source include/wait_until_connected_again.inc --save_master_pos --connection slave SET GLOBAL debug_dbug= @old_dbug; SET GLOBAL rpl_semi_sync_slave_enabled= @old_semi_enabled; --source include/stop_slave.inc --source include/start_slave.inc --sync_with_master --connection default DROP TABLE t1; --source include/rpl_end.inc