[MariaDB developers] Re: [PATCH 2/2] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

5 Mar 2024

...
commit c51636f254703602f6f6e2e4a260e607e737b9c1 (origin/10.6-MDEV-33551)
Author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
Date:   Mon Feb 26 07:01:47 2024 -0700
MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency
First some overall comments.

The approach of waking up each thread on its own condition seems good to me.
Monty had a different suggestion, which is to have only a single ack sent
per group commit; this could also be a possibility, but it seems orthogonal
to your fix (ie. both could make sense). Monty also had the idea to move the
AFTER_COMMIT sync point into the group commit code, though that changes the
semantics of AFTER_COMMIT somewhat, especially with
--innodb-flush-log-at-trx-commit=3 (or non-InnoDB storage engine). So in
this review I'll focus on keeping your overall approach the same.

Your patch introduces a new wait_queue with the list of waiting threads. But
there is already a very similar queue, m_active_tranxs. It looks to me as if
the m_active_tranx could easily be extended so it can be used for this, and
avoid introducing another queue. Your wait_queue is pushed in
write_tranx_in_binlog() which already does
m_active_tranxs->insert_tranx_node(), and it is popped in
report_reply_binlog() which also does
m_active_tranxs->clear_active_tranx_nodes().

So add some info to struct Tranx_node, maybe a THD * is all that is needed?
A bit of care must be taken about lifetime of Tranx_node and THD
respectively (in case one goes away but not the other). But the whole
m_active_tranxs is already protected by a lock, and there is a hash for
looking up specific position, so I think a THD that wants to abort its wait
can simply clear the THD * from its Tranx_node or something like that to
ensure the semisync ack will not try to wake up a THD that no longer exists.

You'll need to see how this will affect the code for shutdown
(await_all_slave_replies()) where it needs to wait for all acks (IIUC). But
I hope we can be pragmatic here and find something simple that works.

Another general thing related to this:
...
3) Repl_semi_sync_master::commit_trx() no longer loops to await
        its specific ACK. It waits once, and will either fail from
        timeout, or receive its ACK.
We still need the loop to wait. From `man pthread_cond_wait`:

  "When using condition variables there is always a Boolean predicate  in‐
   volving  shared  variables  associated with each condition wait that is
   true  if  the  thread  should  proceed.  Spurious  wakeups   from   the
   pthread_cond_timedwait()  or  pthread_cond_wait()  functions may occur.
   Since the return from pthread_cond_timedwait()  or  pthread_cond_wait()
   does  not  imply anything about the value of this predicate, the predi‐
   cate should be re-evaluated upon such return."

This affects the code in the patch in some places, detailed comments below.
...
2) The time when thd::is_awaiting_semi_sync_ack is set is moved
        to at binlogging time, to ensure transactions which have been
        binlogged and queued up to await an ACK are not killed,
        and are still waited on.
I don't understand this. Why shouldn't a session waiting for ACK be
killable? On the contrary, such a wait can take potentially a long time,
seems important that it can be killed?

I don't fully understand the existing code for this, maybe the patch is
correct and I'm just looking for an explanation why it is correct like this.
Or maybe this is a limitation of the existing code, and your patch doesn't
change it (so having the ability to kill a waiting thread could be added in
a different patch)?
...

diff --git a/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test b/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test
index f8fa0a99d9c..ebe15a4ca32 100644
--- a/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test
+++ b/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test
@@ -22,6 +22,7 @@
 --source include/master-slave.inc
--connection master
+call mtr.add_suppression("Got an error reading communication packets");
Is this suppression needed? If so, why?
...
diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index e224871795e..b315edc091c 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -1750,18 +1750,12 @@ static void close_connections(void)
   /*
     If we are waiting on any ACKs, delay killing the thread until either an ACK
     is received or the timeout is hit.
...
-    while (waiting_threads-- > 0)
-      repl_semisync_master.await_slave_reply();
+    repl_semisync_master.await_all_slave_replies();
Just curious here, why did you change this code? Was it just because now
there is no general COND_binlog_send to wait on? Or was it fix a bug not
directly related to the main issue?

I'm asking because if it makes things simpler, I think it's fine to keep the
COND_binlog_send and signal it in addition to the individual thread-specific
condition (in fact your patch doesn't seem to remove it though it looks as
if it's no longer used).

On the other hand, this old code in close_connections() that simply calls
await_slave_reply() N times (N = sync_get_master_wait_sessions()) looks
quite wrong to me, looping for a specific number of times makes no sense. So
fixing it is good in any case.
...
diff --git a/sql/semisync_master.cc b/sql/semisync_master.cc
index 0eaf0f0e0e2..f421e924a61 100644
--- a/sql/semisync_master.cc
+++ b/sql/semisync_master.cc
@@ -502,16 +502,33 @@ void Repl_semi_sync_master::unlock()
void Repl_semi_sync_master::cond_broadcast()
 {
-  mysql_cond_broadcast(&COND_binlog_send);
+  while (!wait_queue.empty())
+  {
+    semisync_wait_trx_t next_waiter= wait_queue.front();
Maybe this function should now have a better name than cond_broadcast()
(wakeup_ready_tranx() or something?). But this code will probably change in
any case if the wait_queue is folded into m_active_tranxs.
...
-int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time)
+int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *wait_time)
I think this function now makes little sense, clearer just to call directly
mysql_cond_wait(thd->COND_wakeup_ready) at the caller, what do you think?
...
@@ -695,12 +712,36 @@ int Repl_semi_sync_master::report_reply_binlog(uint32 server_id,
...
+    lock();
+    while (!wait_queue.empty())
+    {
+      semisync_wait_trx_t next_waiter= wait_queue.front();
-    cond_broadcast();
+      cmp= Active_tranx::compare(m_reply_file_name, m_reply_file_pos,
+                                 next_waiter.binlog_name, next_waiter.binlog_pos);
+      if (cmp >= 0)
+      {
+        DBUG_PRINT("semisync", ("%s: signal thread %llu.",
+                                "Repl_semi_sync_master::report_reply_binlog",
+                                next_waiter.thd->thread_id));
+        mysql_cond_signal(&next_waiter.thd->COND_wakeup_ready);
+        wait_queue.pop();
This code is very similar to Active_tranx::clear_active_tranx_nodes(), so
hopefully the two can be integrated and wait_queue becomes unnecessary.
...
@@ -779,7 +821,20 @@ int Repl_semi_sync_master::report_binlog_update(THD* thd, const char *log_file,
     strcpy(log_info->log_file, log_file + dirname_length(log_file));
     log_info->log_pos = log_pos;
-    return write_tranx_in_binlog(log_info->log_file, log_pos);
+    /*
+      THD arg depends on wait point mode. If after storage engine commit, the
+      individual connection threads will perform the wait for semi-sync ACKt,
+      thd is the thread of the user connection thread.
+      If it is after binlog sync, the binlog leader thread will perform the
+      semi-sync waits on behalf of the grouped transaction (which at this
+      point, we (current_thd) are the leader). If using binlog_group_commit,
+      thd is the thread of the user connection thread.
+    */
+    thd_to_cond_wait= ((wait_point() == SEMI_SYNC_MASTER_WAIT_POINT_AFTER_STORAGE_COMMIT)
+             ? thd
+             : current_thd);
This condition is complex. It's good that you added the detailed comment
explaining what's going on, but I think it would be better to avoid it
completely. Why not just pass in the value of thd_to_cond_wait explicitly
from the caller?

As an aside, we have this code in trx_group_commit_leader():

  for (current= queue; current != NULL; current= current->next) {
      current->error=
	repl_semisync_master.wait_after_sync(current->cache_mngr->
					     last_commit_pos_file,
					     current->cache_mngr->
					     last_commit_pos_offset);

It seems silly to wait for every binlog position to be acked her one by one,
why not just wait for the last one?

  repl_semisync_master.wait_after_sync(last_in_queue->cache_mngr->last_commit_pos_file, ...);

But while related, it's perhaps not directly part of the problem you're
addressing in your patch, so up to you if you want to include this or leave
it for a possible later patch.
...
@@ -852,12 +907,11 @@ int Repl_semi_sync_master::commit_trx(const char* trx_wait_binlog_name,
...
-    if (!get_master_enabled() || !is_on())
+    if (!get_master_enabled() || !is_on() || thd_killed(thd))
       goto l_end;
...
-    while (is_on() && !thd_killed(thd))
-    {
-      /* We have to check these again as things may have changed */
-      if (!rpl_semi_sync_master_clients && !rpl_semi_sync_master_wait_no_slave)
-      {
-        aborted= 1;
-        break;
-      }
As described earlier, because of the (at least theoretical) possibility of
spurious wakeup, we should keep the while loop here and the checks.
...
@@ -1253,6 +1294,8 @@ int Repl_semi_sync_master::write_tranx_in_binlog(const char* log_file_name,
     }
     else
     {
+      wait_queue.push({log_file_name, log_file_pos, thd});
Hopefully this can be instead folded into
m_active_tranxs->insert_tranx_node() which is called just a bit below.
...
-void Repl_semi_sync_master::await_slave_reply()
+void Repl_semi_sync_master::await_all_slave_replies()
...
+  while (TRUE)
+  {
+    lock();
+    if (wait_queue.empty() || !get_master_enabled() || !is_on())
+    {
+      unlock();
+      break;
+    }
+    front= wait_queue.front();
     create_timeout(&abstime, NULL);
-  cond_timewait(&abstime);
-
-end:
+    wait_result= cond_timewait(front.thd, &abstime);
I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready?
Isn't there a possibility that this might go away during the wait?

A simple way to avoid this could be just to keep the original global
COND_binlog_send and wait for that here (and signal both that and
thd->COND_wakeup_ready when waking up). IIUC, this code will only run in a
single thread during shutdown, so probably fine to have some extra
unnecessary wakeups here.
...
diff --git a/sql/semisync_master.h b/sql/semisync_master.h
index 99f46869354..6de0333d4ce 100644
--- a/sql/semisync_master.h
+++ b/sql/semisync_master.h
@@ -21,6 +21,7 @@
...
+/*
+  Element in Repl_semi_sync_master::wait_queue to preserve the state of a
+  transaction waiting for an ACK.
+*/
+typedef struct _semisync_wait_trx {
+  const char *binlog_name;
+  my_off_t binlog_pos;
+  THD *thd;
+} semisync_wait_trx_t;
...
+  std::queue<semisync_wait_trx_t> wait_queue;
Let's see if we can avoid this, and use the existing m_active_tranxs queue
instead.


diff --git a/sql/semisync_master.cc b/sql/semisync_master.cc
index 8cc721e5737..0eaf0f0e0e2 100644
--- a/sql/semisync_master.cc
+++ b/sql/semisync_master.cc
@@ -979,6 +979,14 @@ int Repl_semi_sync_master::commit_trx(const char* trx_wait_binlog_name,
         {
           rpl_semi_sync_master_trx_wait_num++;
           rpl_semi_sync_master_trx_wait_time += wait_time;
+          /*
+            Assert we have either recieved our ACK; or have timed out and are
+            awoken in an off state.
+          */
+          DBUG_ASSERT(!get_master_enabled() || !is_on() || thd->is_killed() ||
+                      0 <= Active_tranx::compare(
+                               m_reply_file_name, m_reply_file_pos,
+                               trx_wait_binlog_name, trx_wait_binlog_pos));
         }
       }
     }

I think it's a bit much to crash the server here (even if only debug build),
because of the possibility of spurious wakeups.

It's good to have the test case though, maybe put the assertion inside
DBUG_EXECUTE_IF()?
I'd make it just log an error in the error log which the test case could
flag (to avoid crashing), but that's a matter of style I guess and up to
you.

 - Kristian.

    

[MariaDB developers] Re: [PATCH 2/2] MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

Kristian Nielsen