[MariaDB developers] Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

12 Mar 2024

...
From: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
When using semi-sync replication with
rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the
primary can significantly reduce compared to AFTER_SYNC's
performance for workloads with many concurrent users executing
transactions. This is because all connections on the primary share
Hi Brandon,

Here is my second round review of the MDEV-33551 patch.

I think these comments of mine from first round still need to be addressed:
...
...
A bit of care must be taken about lifetime of Tranx_node and THD
respectively (in case one goes away but not the other). But the whole
...
...
...
+void Repl_semi_sync_master::await_all_slave_replies()
...
...
...
+    wait_result= cond_timewait(front.thd, &abstime);
I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready?
Isn't there a possibility that this might go away during the wait?
I tried a simple testcase (attached) which disconnects the session thread
while await_all_slave_replies() is running during shutdown. As I expected, I
could see the COND_wakeup_ready of the session thread being de-allocated
while await_all_slave_replies() is still waking up from that condition. That
cannot be correct.

Similarly, it seems clearly possible that a session thread is killed and
exits while its entry is still present in the m_active_tranxs list, which
will similarly lead to signalling a de-allocated COND_wakeup_ready.

When a thread aborts the wait before the entry is removed, I think it should
simply look up its entry (can use the m_trx_htb hash table, or even just
iterate the list) and set its THD to NULL (or remove the entry, whatever is
simpler).

For the await_all_slave_replies() problem, I think the whole
wait-at-shutdown is much too complex as it is now. The whole
is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not
killing threads waiting for semi-sync-ack, and now introducing
unmark_thd_awaiting_ack. And it still has the bug waiting on a
COND_wakeup_ready of another THD that can disappear during the wait.

I think all of that should just go away, the wait in
await_all_slave_replies() should be done much simpler with a dedicated
condition variable to wait for. Eg. await_all_slave_replies() just does a
mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog
position is ack'ed, and this condition is then signalled from
report_reply_binlog() (either always, or await_all_slave_replies() can set a
flag to request signal). Or have the shutdown thread enqueue its THD at the
end of the m_active_tranxs list. Let's avoid all this complex interaction
between SHUTDOWN WAIT FOR ALL SLAVES and session threads.

BTW, COND_binlog_send is currently unused in your patch.

Below detailed comments on individual parts of the patch:
...
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -1750,18 +1750,12 @@ static void close_connections(void)
...
-  if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled())
+  if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled() &&
+      repl_semisync_master.sync_get_master_wait_sessions())
Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait
unconditionally for the last binlog position sent?
...
--- a/sql/semisync_master.cc
+++ b/sql/semisync_master.cc
@@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct timespec *ts)
   return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec / TIME_THOUSAND;
 }
+int signal_waiting_transaction(THD *waiting_thd, const char *binlog_file,
+                                my_off_t binlog_pos)
+{
+  if (waiting_thd->is_awaiting_semisync_ack)
+    mysql_cond_signal(&waiting_thd->COND_wakeup_ready);
The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD
will be waiting, otherwise it would not be in the list in the first place,
so faster with an occasional redundant signal than a conditional always. And
if the THD is not waiting, it's not safe to access it anyway.

Instead, as mentioned above, have a check for thd != NULL here, and let a
non-waiting thread clear the THD in its entry before it releases the
LOCK_binlog.
...
-int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time)
+int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *_wait_time)
...
+  if (_wait_time == NULL)
+  {
+    create_timeout(&gen_wait_time, NULL);
+    real_wait_time= &gen_wait_time;
+  }
+  else
+  {
+    real_wait_time= _wait_time;
+  }
Don't pass in booleans that are always hard-coded true or false in the
caller. Instead create two different functions, one that takes a time_spec,
and one that does not (the latter creating it and calling the former).
...
@@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave()
       Signal transactions waiting in commit_trx() that they do not have to
       wait anymore.
     */
-    cond_broadcast();
+    m_active_tranxs->clear_active_tranx_nodes(NULL, NULL,
+                                              signal_waiting_transaction);
This does not compile on my machine (the second NULL is passed to my_off_t
position), is that a last-minute change that was never tested? Or NULL is
defined differently on your machine, so it doesn't fail to compile?

BTW, question about the old code: Why didn't the old code clear the
m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and
if so where)?
...
-void Repl_semi_sync_master::await_slave_reply()
+void Repl_semi_sync_master::await_all_slave_replies()
...
+    The lock is taken per iteration rather than over the whole loop to allow
+    more opportunity for the user threads to handle their ACKs.
That sounds like a mis-understanding. The lock is always released by
mysql_cond_timedwait() in each iteration of the loop, no need to
release/take it again for just a couple instructions.
...
+  while (TRUE)
+  {
+    lock();
+    if (m_active_tranxs->is_empty() || !get_master_enabled() || !is_on())
+    {
+      unlock();
+      break;
+    }
+    wait_result= m_active_tranxs->for_front(
+        [](THD *thd, const char *binlog_file, my_off_t binlog_pos) -> int {
+          return (thd->is_awaiting_semisync_ack)
+                     ? repl_semisync_master.cond_timewait(thd, NULL)
+                     : 0;
+        });
As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of
another THD, since that THD might go away during the wait. Instead, let's do
a specific wait for the end position here, either on COND_binlog or by
enqueing an extra THD at the end.

(And btw, in your code, why wait for the _front_ of the list and not the
_rear_?)
...

diff --git a/sql/semisync_master.h b/sql/semisync_master.h
index 99f46869354..1eb690b572b 100644
--- a/sql/semisync_master.h
+++ b/sql/semisync_master.h
+   * The pre_delete_hook parameter is a function pointer that will be invoked
+   * for each Active_tranx node, in order, from m_trx_front to log_file_name,
+   * e.g. to signal their wakeup condition. Repl_semi_sync_binlog::LOCK_binlog
+   * is held while this is invoked.
"from m_trx_front to log_file_name", I don't understand?
(Typo s/log_file_name/m_trx_rear/ ?)
...
+/*
+  Element in Repl_semi_sync_master::wait_queue to preserve the state of a
+  transaction waiting for an ACK.
+*/
+typedef struct _semisync_wait_trx {
+  const char *binlog_name;
+  my_off_t binlog_pos;
+  THD *thd;
+} semisync_wait_trx_t;
+
This is not longer used (should be removed).
...
* Input:  (the transaction events' ending binlog position)
    *  trx_wait_binlog_name -     (IN)  ending position's file name
    *  trx_wait_binlog_pos  -     (IN)  ending position's file offset
+   *  unmark_thd_awaiting_ack  - (IN)  Whether or not to unmark
+   *                             thd->is_awaiting_semisync_ack after receiving
+   *                             an ACK from the replica. If using wait_point
+   *                             AFTER_SYNC with binlog_group_commit, only the
+   *                             last transaction written to binlog in the
+   *                             group should negate the boolean, because the
+   *                             same thread (i.e. leader thread) will wait for
+   *                             all transaction ACKs. Negating it too early
+   *                             would break SHUTDOWN WAIT FOR ALL SLAVES, as
+   *                             that is the condition tested to bypass killing
+   *                             threads in phase 1. In all other situations,
+   *                             the boolean should be negated immediately.
That seems unnecessarily convoluted.
I think the real problem here is that write_tranx_in_binlog() sets the
is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead
of the trx_thd.

But once we fix the await_all_slave_replies(), do we even need the
is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point
becomes moot anyway.

 - Kristian.


--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_binlog_format_mixed.inc
--source include/master-slave.inc

--connection master
CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1,0);
--sync_slave_with_master

source include/stop_slave.inc;
SET @old_semi_enabled= @@GLOBAL.rpl_semi_sync_slave_enabled;
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
--source include/start_slave.inc

--connection master
SET GLOBAL rpl_semi_sync_master_enabled= 1;
SET GLOBAL rpl_semi_sync_master_timeout= 5000;

let $status_var= Rpl_semi_sync_master_clients;
let $status_var_value= 1;
source include/wait_for_status_var.inc;

show status like 'Rpl_semi_sync_master_status';
show status like 'Rpl_semi_sync_master_clients';


--connection slave
SET @old_dbug= @@GLOBAL.debug_dbug;
SET GLOBAL debug_dbug="+d,simulate_delay_semisync_slave_reply";

--connect(con1, localhost, root,,)
--connect(con2, localhost, root,,)

--connection con1
send INSERT INTO t1 VALUES (2,0);
--sleep 0.3
--disconnect con1

--connection default
--write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
wait
EOF
#let $status_var= Rpl_semi_sync_master_wait_sessions;
#let $status_var_value= 1;
#source include/wait_for_status_var.inc;
--sleep 0.1

--connection con2
SHUTDOWN WAIT FOR ALL SLAVES;
--disconnect con2

--connection default
--source include/wait_until_disconnected.inc

--connection master
--source include/wait_until_disconnected.inc

--connection server_1
--source include/wait_until_disconnected.inc

--connection default
--append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
restart
EOF
--enable_reconnect
--source include/wait_until_connected_again.inc

--connection master
--enable_reconnect
--source include/wait_until_connected_again.inc

--connection server_1
--enable_reconnect
--source include/wait_until_connected_again.inc

--save_master_pos
--connection slave
SET GLOBAL debug_dbug= @old_dbug;
SET GLOBAL rpl_semi_sync_slave_enabled= @old_semi_enabled;
--source include/stop_slave.inc
--source include/start_slave.inc
--sync_with_master

--connection default
DROP TABLE t1;
--source include/rpl_end.inc

    

[MariaDB developers] Re: [PATCH] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

Kristian Nielsen