Thanks for the review, Kristian! See my replies below.

On Tue, Mar 5, 2024 at 6:34 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:

> commit c51636f254703602f6f6e2e4a260e607e737b9c1 (origin/10.6-MDEV-33551)
> Author: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
> Date: Mon Feb 26 07:01:47 2024 -0700
>
> MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency

First some overall comments.

The approach of waking up each thread on its own condition seems good to me.
Monty had a different suggestion, which is to have only a single ack sent
per group commit; this could also be a possibility, but it seems orthogonal
to your fix (ie. both could make sense). Monty also had the idea to move the
AFTER_COMMIT sync point into the group commit code, though that changes the
semantics of AFTER_COMMIT somewhat, especially with
--innodb-flush-log-at-trx-commit=3 (or non-InnoDB storage engine). So in
this review I'll focus on keeping your overall approach the same.

Right. The group ACK additions are tracked by MDEV-33491 so we don't lose

the idea.

Your patch introduces a new wait_queue with the list of waiting threads. But
there is already a very similar queue, m_active_tranxs. It looks to me as if
the m_active_tranx could easily be extended so it can be used for this, and
avoid introducing another queue. Your wait_queue is pushed in
write_tranx_in_binlog() which already does
m_active_tranxs->insert_tranx_node(), and it is popped in
report_reply_binlog() which also does
m_active_tranxs->clear_active_tranx_nodes().

So add some info to struct Tranx_node, maybe a THD * is all that is needed?
A bit of care must be taken about lifetime of Tranx_node and THD
respectively (in case one goes away but not the other). But the whole
m_active_tranxs is already protected by a lock, and there is a hash for
looking up specific position, so I think a THD that wants to abort its wait
can simply clear the THD * from its Tranx_node or something like that to
ensure the semisync ack will not try to wake up a THD that no longer exists.

This is a great connection; I'll definitely incorporate this, thanks! (Note that your

individual code comments to this point I'll skip replying to).

You'll need to see how this will affect the code for shutdown
(await_all_slave_replies()) where it needs to wait for all acks (IIUC). But
I hope we can be pragmatic here and find something simple that works.

It should be a simple change to accommodate it.

Another general thing related to this:

> 3) Repl_semi_sync_master::commit_trx() no longer loops to await
> its specific ACK. It waits once, and will either fail from
> timeout, or receive its ACK.

We still need the loop to wait. From `man pthread_cond_wait`:

"When using condition variables there is always a Boolean predicate in‐
volving shared variables associated with each condition wait that is
true if the thread should proceed. Spurious wakeups from the
pthread_cond_timedwait() or pthread_cond_wait() functions may occur.
Since the return from pthread_cond_timedwait() or pthread_cond_wait()
does not imply anything about the value of this predicate, the predi‐
cate should be re-evaluated upon such return."

Good catch.

This affects the code in the patch in some places, detailed comments below.

> 2) The time when thd::is_awaiting_semi_sync_ack is set is moved
> to at binlogging time, to ensure transactions which have been
> binlogged and queued up to await an ACK are not killed,
> and are still waited on.

I don't understand this. Why shouldn't a session waiting for ACK be
killable? On the contrary, such a wait can take potentially a long time,
seems important that it can be killed?

I don't fully understand the existing code for this, maybe the patch is
correct and I'm just looking for an explanation why it is correct like this.
Or maybe this is a limitation of the existing code, and your patch doesn't
change it (so having the ability to kill a waiting thread could be added in
a different patch)?

thd->is_awaiting_semi_sync_ack is only used when running

SHUTDOWN WAIT FOR ALL SLAVES

which has the clause:

"When the WAIT FOR ALL REPLICAS option is provided, the server only

kills its binary log dump threads after all client threads have been killed,

and it only completes the shutdown after the last binary log has been sent

to all connected replicas."

So the idea with my fix is that, once we have noted the update to the binary

log, we immediately mark in thd->is_awaiting_semi_sync_ack, so the

shutdown will await its ACK.

> diff --git a/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test b/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test
> index f8fa0a99d9c..ebe15a4ca32 100644
> --- a/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test
> +++ b/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test
> @@ -22,6 +22,7 @@
> --source include/master-slave.inc
>
> --connection master
> +call mtr.add_suppression("Got an error reading communication packets");

Is this suppression needed? If so, why?

If I remember right, when the slave io thread stops, the master's ACK receiver can

log that error when trying to read from the socket. Though note this is about to change

with Monty's upcoming MDEV-33582 patch, so it'll likely have to be amended shortly

anyway.

> diff --git a/sql/mysqld.cc b/sql/mysqld.cc
> index e224871795e..b315edc091c 100644
> --- a/sql/mysqld.cc
> +++ b/sql/mysqld.cc
> @@ -1750,18 +1750,12 @@ static void close_connections(void)
> /*
> If we are waiting on any ACKs, delay killing the thread until either an ACK
> is received or the timeout is hit.

> - while (waiting_threads-- > 0)
> - repl_semisync_master.await_slave_reply();
> + repl_semisync_master.await_all_slave_replies();

Just curious here, why did you change this code? Was it just because now
there is no general COND_binlog_send to wait on? Or was it fix a bug not
directly related to the main issue?

I'm asking because if it makes things simpler, I think it's fine to keep the
COND_binlog_send and signal it in addition to the individual thread-specific
condition (in fact your patch doesn't seem to remove it though it looks as
if it's no longer used).

On the other hand, this old code in close_connections() that simply calls
await_slave_reply() N times (N = sync_get_master_wait_sessions()) looks
quite wrong to me, looping for a specific number of times makes no sense. So
fixing it is good in any case.

It was changed because the old code would wait on each individual ACK to

be received from the slave (but with sporadic wake-ups it was definitely broken,

because it manually tried to count the number of waiting threads with each

ACK iteration removing a thd). So keeping it will fix this issue, as well as wait

on each specific transaction's ACK (rather than the general case).

> diff --git a/sql/semisync_master.cc b/sql/semisync_master.cc
> index 0eaf0f0e0e2..f421e924a61 100644
> --- a/sql/semisync_master.cc
> +++ b/sql/semisync_master.cc
> @@ -502,16 +502,33 @@ void Repl_semi_sync_master::unlock()
>
> void Repl_semi_sync_master::cond_broadcast()
> {
> - mysql_cond_broadcast(&COND_binlog_send);
> + while (!wait_queue.empty())
> + {
> + semisync_wait_trx_t next_waiter= wait_queue.front();

Maybe this function should now have a better name than cond_broadcast()
(wakeup_ready_tranx() or something?). But this code will probably change in
any case if the wait_queue is folded into m_active_tranxs.

I had originally thought about it, and kept it with the rationale that it kept the

semantics of broadcast - i.e., it would wake-up / remove _all_ waiting threads

(because it is only used when turning off semi-sync). But now on second thought,

because the transactions use different cond vars, that is probably misleading

to keep broadcast, so I'll change it.

> -int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time)
> +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *wait_time)

I think this function now makes little sense, clearer just to call directly
mysql_cond_wait(thd->COND_wakeup_ready) at the caller, what do you think?

Sure, that'd be easier to keep track of.

> @@ -695,12 +712,36 @@ int Repl_semi_sync_master::report_reply_binlog(uint32 server_id,

> + lock();
> + while (!wait_queue.empty())
> + {
> + semisync_wait_trx_t next_waiter= wait_queue.front();
>
> - cond_broadcast();
> + cmp= Active_tranx::compare(m_reply_file_name, m_reply_file_pos,
> + next_waiter.binlog_name, next_waiter.binlog_pos);
> + if (cmp >= 0)
> + {
> + DBUG_PRINT("semisync", ("%s: signal thread %llu.",
> + "Repl_semi_sync_master::report_reply_binlog",
> + next_waiter.thd->thread_id));
> + mysql_cond_signal(&next_waiter.thd->COND_wakeup_ready);
> + wait_queue.pop();

This code is very similar to Active_tranx::clear_active_tranx_nodes(), so
hopefully the two can be integrated and wait_queue becomes unnecessary.

> @@ -779,7 +821,20 @@ int Repl_semi_sync_master::report_binlog_update(THD* thd, const char *log_file,
> strcpy(log_info->log_file, log_file + dirname_length(log_file));
> log_info->log_pos = log_pos;
>
> - return write_tranx_in_binlog(log_info->log_file, log_pos);
> + /*
> + THD arg depends on wait point mode. If after storage engine commit, the
> + individual connection threads will perform the wait for semi-sync ACKt,
> + thd is the thread of the user connection thread.
> + If it is after binlog sync, the binlog leader thread will perform the
> + semi-sync waits on behalf of the grouped transaction (which at this
> + point, we (current_thd) are the leader). If using binlog_group_commit,
> + thd is the thread of the user connection thread.
> + */
> + thd_to_cond_wait= ((wait_point() == SEMI_SYNC_MASTER_WAIT_POINT_AFTER_STORAGE_COMMIT)
> + ? thd
> + : current_thd);

This condition is complex. It's good that you added the detailed comment
explaining what's going on, but I think it would be better to avoid it
completely. Why not just pass in the value of thd_to_cond_wait explicitly
from the caller?

I'd gone back and forth on that, as it also seemed confusing to have to

pass in two threads into the function. But I suppose either way the confusion

exists, and having it as an argument removes one layer of magic. So sure,

I'll do that.

As an aside, we have this code in trx_group_commit_leader():

for (current= queue; current != NULL; current= current->next) {
current->error=
repl_semisync_master.wait_after_sync(current->cache_mngr->
last_commit_pos_file,
current->cache_mngr->
last_commit_pos_offset);

It seems silly to wait for every binlog position to be acked her one by one,
why not just wait for the last one?

repl_semisync_master.wait_after_sync(last_in_queue->cache_mngr->last_commit_pos_file, ...);

But while related, it's perhaps not directly part of the problem you're
addressing in your patch, so up to you if you want to include this or leave
it for a possible later patch.

Right, I'd like to address it in MDEV-33491. (I discuss a little bit on why in the

last paragraph of the description on the JIRA itself).

> @@ -852,12 +907,11 @@ int Repl_semi_sync_master::commit_trx(const char* trx_wait_binlog_name,

> - if (!get_master_enabled() || !is_on())
> + if (!get_master_enabled() || !is_on() || thd_killed(thd))
> goto l_end;

> - while (is_on() && !thd_killed(thd))
> - {
> - /* We have to check these again as things may have changed */
> - if (!rpl_semi_sync_master_clients && !rpl_semi_sync_master_wait_no_slave)
> - {
> - aborted= 1;
> - break;
> - }

As described earlier, because of the (at least theoretical) possibility of
spurious wakeup, we should keep the while loop here and the checks.

> @@ -1253,6 +1294,8 @@ int Repl_semi_sync_master::write_tranx_in_binlog(const char* log_file_name,
> }
> else
> {
> + wait_queue.push({log_file_name, log_file_pos, thd});

Hopefully this can be instead folded into
m_active_tranxs->insert_tranx_node() which is called just a bit below.

> -void Repl_semi_sync_master::await_slave_reply()
> +void Repl_semi_sync_master::await_all_slave_replies()

> + while (TRUE)
> + {
> + lock();
> + if (wait_queue.empty() || !get_master_enabled() || !is_on())
> + {
> + unlock();
> + break;
> + }
> + front= wait_queue.front();
> create_timeout(&abstime, NULL);
> - cond_timewait(&abstime);
> -
> -end:
> + wait_result= cond_timewait(front.thd, &abstime);

I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready?
Isn't there a possibility that this might go away during the wait?

A simple way to avoid this could be just to keep the original global
COND_binlog_send and wait for that here (and signal both that and
thd->COND_wakeup_ready when waking up). IIUC, this code will only run in a
single thread during shutdown, so probably fine to have some extra
unnecessary wakeups here.

I thought about that originally, and I don't think it can go away, as we

have a lock on changes in the wait_queue (to be Active_tranx), and for

the underlying thread to progress past its wait, it cannot be in the queue

any longer. It would be either removed by the Ack_receiver when signalling

to move on, or by any thread which times out/is killed and removes it from

the queue. A good assertion to add though. Perhaps just to assert

thd->is_awaiting_semisync_ack.

> diff --git a/sql/semisync_master.h b/sql/semisync_master.h
> index 99f46869354..6de0333d4ce 100644
> --- a/sql/semisync_master.h
> +++ b/sql/semisync_master.h
> @@ -21,6 +21,7 @@

> +/*
> + Element in Repl_semi_sync_master::wait_queue to preserve the state of a
> + transaction waiting for an ACK.
> +*/
> +typedef struct _semisync_wait_trx {
> + const char *binlog_name;
> + my_off_t binlog_pos;
> + THD *thd;
> +} semisync_wait_trx_t;

> + std::queue<semisync_wait_trx_t> wait_queue;

Let's see if we can avoid this, and use the existing m_active_tranxs queue
instead.

diff --git a/sql/semisync_master.cc b/sql/semisync_master.cc
index 8cc721e5737..0eaf0f0e0e2 100644
--- a/sql/semisync_master.cc
+++ b/sql/semisync_master.cc
@@ -979,6 +979,14 @@ int Repl_semi_sync_master::commit_trx(const char* trx_wait_binlog_name,
{
rpl_semi_sync_master_trx_wait_num++;
rpl_semi_sync_master_trx_wait_time += wait_time;
+ /*
+ Assert we have either recieved our ACK; or have timed out and are
+ awoken in an off state.
+ */
+ DBUG_ASSERT(!get_master_enabled() || !is_on() || thd->is_killed() ||
+ 0 <= Active_tranx::compare(
+ m_reply_file_name, m_reply_file_pos,
+ trx_wait_binlog_name, trx_wait_binlog_pos));
}
}
}

I think it's a bit much to crash the server here (even if only debug build),
because of the possibility of spurious wakeups.

It's good to have the test case though, maybe put the assertion inside
DBUG_EXECUTE_IF()?
I'd make it just log an error in the error log which the test case could
flag (to avoid crashing), but that's a matter of style I guess and up to
you.

Agreed.

- Kristian.

In summary (mostly for my own ease of reference):

1. Use the existing list maintained by Active_tranx instead of the new

wait_queue

2. Put back the loop for cond_var, as the waiter can spuriously awake

3. Rename cond_broadcast appropriately

4. Remove cond_timewait in favor of directly calling mysql_cond_wait

5. Move thd_to_cond_wait to be a parameter, rather than a magic

condition

6. Assert thd->is_awaiting_semisync_ack in await_all_slave_replies()

7. Change DBUG_ASSERT to a DBUG_EXECUTE_IF error log

message when a transaction is awoken.

8. Address group ACK logic with MDEV-33491.

Brandon