Thanks for the review, Kristian!

I agree with your summary, see my replies to your specific points below.

On Tue, Mar 12, 2024 at 5:29 AM Kristian Nielsen <knielsen@knielsen-hq.org> wrote:
> From: Brandon Nesterenko <brandon.nesterenko@mariadb.com>
>
> When using semi-sync replication with
> rpl_semi_sync_master_wait_point=AFTER_COMMIT, the performance of the
> primary can significantly reduce compared to AFTER_SYNC's
> performance for workloads with many concurrent users executing
> transactions. This is because all connections on the primary share

Hi Brandon,

Here is my second round review of the MDEV-33551 patch.

I think these comments of mine from first round still need to be addressed:

> > A bit of care must be taken about lifetime of Tranx_node and THD
> > respectively (in case one goes away but not the other). But the whole

> > > +void Repl_semi_sync_master::await_all_slave_replies()

> > > +    wait_result= cond_timewait(front.thd, &abstime);
> >
> > I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready?
> > Isn't there a possibility that this might go away during the wait?

I tried a simple testcase (attached) which disconnects the session thread
while await_all_slave_replies() is running during shutdown. As I expected, I
could see the COND_wakeup_ready of the session thread being de-allocated
while await_all_slave_replies() is still waking up from that condition. That
cannot be correct.

Similarly, it seems clearly possible that a session thread is killed and
exits while its entry is still present in the m_active_tranxs list, which
will similarly lead to signalling a de-allocated COND_wakeup_ready.

When a thread aborts the wait before the entry is removed, I think it should
simply look up its entry (can use the m_trx_htb hash table, or even just
iterate the list) and set its THD to NULL (or remove the entry, whatever is
simpler).

For the await_all_slave_replies() problem, I think the whole
wait-at-shutdown is much too complex as it is now. The whole
is_thd_awaiting_semisync_ack(), n_threads_awaiting_ack, special-case not
killing threads waiting for semi-sync-ack, and now introducing
unmark_thd_awaiting_ack. And it still has the bug waiting on a
COND_wakeup_ready of another THD that can disappear during the wait.

I think all of that should just go away, the wait in
await_all_slave_replies() should be done much simpler with a dedicated
condition variable to wait for. Eg. await_all_slave_replies() just does a
mysql_cond_wait(COND_binlog_send) under LOCK_binlog until the last binlog
position is ack'ed, and this condition is then signalled from
report_reply_binlog() (either always, or await_all_slave_replies() can set a
flag to request signal). Or have the shutdown thread enqueue its THD at the
end of the m_active_tranxs list. Let's avoid all this complex interaction
between SHUTDOWN WAIT FOR ALL SLAVES and session threads.

BTW, COND_binlog_send is currently unused in your patch.

Below detailed comments on individual parts of the patch:

> --- a/sql/mysqld.cc
> +++ b/sql/mysqld.cc
> @@ -1750,18 +1750,12 @@ static void close_connections(void)

> -  if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled())
> +  if (shutdown_wait_for_slaves && repl_semisync_master.get_master_enabled() &&
> +      repl_semisync_master.sync_get_master_wait_sessions())

Do you need to check sync_get_master_wait_sessions() > 0 here? Why not wait
unconditionally for the last binlog position sent?

Ah this was an original simplification before I changed where set_awaiting_semisync_ack
was set, but leaving it so is a bug, as we can have items in the wait list, yet not be
waiting on the ACK. Even with the fix, I wonder if we should also move where we increase
the wait_sessions variable to be when we enqueue into Active_tranx (currently it is in
commit_trx() just before the wait). I suppose it depends on how we want to define a wait
session. I.e., when we implement group ACK, how do we want to report the leader actually
waiting for ACK, vs non-leaders waiting on the leader's ACK.
 

> --- a/sql/semisync_master.cc
> +++ b/sql/semisync_master.cc
> @@ -68,6 +68,14 @@ static ulonglong timespec_to_usec(const struct timespec *ts)
>    return (ulonglong) ts->tv_sec * TIME_MILLION + ts->tv_nsec / TIME_THOUSAND;
>  }

> +int signal_waiting_transaction(THD *waiting_thd, const char *binlog_file,
> +                                my_off_t binlog_pos)
> +{
> +  if (waiting_thd->is_awaiting_semisync_ack)
> +    mysql_cond_signal(&waiting_thd->COND_wakeup_ready);

The check on is_awaiting_semi_sync_ack seems wrong here. Normally, the THD
will be waiting, otherwise it would not be in the list in the first place,
so faster with an occasional redundant signal than a conditional always. And
if the THD is not waiting, it's not safe to access it anyway.

Instead, as mentioned above, have a check for thd != NULL here, and let a
non-waiting thread clear the THD in its entry before it releases the
LOCK_binlog.

Ah, that makes more sense. Will do.
 

> -int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time)
> +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *_wait_time)

> +  if (_wait_time == NULL)
> +  {
> +    create_timeout(&gen_wait_time, NULL);
> +    real_wait_time= &gen_wait_time;
> +  }
> +  else
> +  {
> +    real_wait_time= _wait_time;
> +  }

Don't pass in booleans that are always hard-coded true or false in the
caller. Instead create two different functions, one that takes a time_spec,
and one that does not (the latter creating it and calling the former).

Sure.
 

> @@ -533,7 +559,8 @@ void Repl_semi_sync_master::remove_slave()
>        Signal transactions waiting in commit_trx() that they do not have to
>        wait anymore.
>      */
> -    cond_broadcast();
> +    m_active_tranxs->clear_active_tranx_nodes(NULL, NULL,
> +                                              signal_waiting_transaction);

This does not compile on my machine (the second NULL is passed to my_off_t
position), is that a last-minute change that was never tested? Or NULL is
defined differently on your machine, so it doesn't fail to compile?


Ah, interesting. With clang it compiles, but with GCC it does not. I'll fix it to `0`. 
 
BTW, question about the old code: Why didn't the old code clear the
m_active_tranxs list? Was it a bug? Or was it cleared elsewhere before (and
if so where)?

It did not clear it, but should have.
 

> -void Repl_semi_sync_master::await_slave_reply()
> +void Repl_semi_sync_master::await_all_slave_replies()

> +    The lock is taken per iteration rather than over the whole loop to allow
> +    more opportunity for the user threads to handle their ACKs.

That sounds like a mis-understanding. The lock is always released by
mysql_cond_timedwait() in each iteration of the loop, no need to
release/take it again for just a couple instructions. 

> +  while (TRUE)
> +  {
> +    lock();
> +    if (m_active_tranxs->is_empty() || !get_master_enabled() || !is_on())
> +    {
> +      unlock();
> +      break;
> +    }
> +    wait_result= m_active_tranxs->for_front(
> +        [](THD *thd, const char *binlog_file, my_off_t binlog_pos) -> int {
> +          return (thd->is_awaiting_semisync_ack)
> +                     ? repl_semisync_master.cond_timewait(thd, NULL)
> +                     : 0;
> +        });

As I wrote at the top, it is not safe to wait on the COND_wakeup_ready of
another THD, since that THD might go away during the wait. Instead, let's do
a specific wait for the end position here, either on COND_binlog or by
enqueing an extra THD at the end.

Sure, I'll keep COND_binlog around and use that (the main thread doesn't have
a THD, and creating one just to enqueue into the list would be more overhead
than keeping COND_binlog).
 

(And btw, in your code, why wait for the _front_ of the list and not the
_rear_?)

Ah, that was just keeping the original logic from pre-patch (to be somewhat
consistent). It would have made sense to optimize to just be the rear, with the new
logic.
 

> diff --git a/sql/semisync_master.h b/sql/semisync_master.h
> index 99f46869354..1eb690b572b 100644
> --- a/sql/semisync_master.h
> +++ b/sql/semisync_master.h
>
> +   * The pre_delete_hook parameter is a function pointer that will be invoked
> +   * for each Active_tranx node, in order, from m_trx_front to log_file_name,
> +   * e.g. to signal their wakeup condition. Repl_semi_sync_binlog::LOCK_binlog
> +   * is held while this is invoked.

"from m_trx_front to log_file_name", I don't understand?
(Typo s/log_file_name/m_trx_rear/ ?)


Huh. Big typo, my intelli-"sense" must have filled it in.
 
> +/*
> +  Element in Repl_semi_sync_master::wait_queue to preserve the state of a
> +  transaction waiting for an ACK.
> +*/
> +typedef struct _semisync_wait_trx {
> +  const char *binlog_name;
> +  my_off_t binlog_pos;
> +  THD *thd;
> +} semisync_wait_trx_t;
> +

This is not longer used (should be removed).

Right, thanks.
 

>     * Input:  (the transaction events' ending binlog position)
>     *  trx_wait_binlog_name -     (IN)  ending position's file name
>     *  trx_wait_binlog_pos  -     (IN)  ending position's file offset
> +   *  unmark_thd_awaiting_ack  - (IN)  Whether or not to unmark
> +   *                             thd->is_awaiting_semisync_ack after receiving
> +   *                             an ACK from the replica. If using wait_point
> +   *                             AFTER_SYNC with binlog_group_commit, only the
> +   *                             last transaction written to binlog in the
> +   *                             group should negate the boolean, because the
> +   *                             same thread (i.e. leader thread) will wait for
> +   *                             all transaction ACKs. Negating it too early
> +   *                             would break SHUTDOWN WAIT FOR ALL SLAVES, as
> +   *                             that is the condition tested to bypass killing
> +   *                             threads in phase 1. In all other situations,
> +   *                             the boolean should be negated immediately.

That seems unnecessarily convoluted.
I think the real problem here is that write_tranx_in_binlog() sets the
is_awaiting_semi_sync_ack flag on the wrong THD, on the waiter_thd instead
of the trx_thd.

But once we fix the await_all_slave_replies(), do we even need the
is_awaiting_semi_sync_ack flag any more? If so, why? If not, then this point
becomes moot anyway.

Ah, right, this is actually a bug as is, as when we are in kill_threads_phase_1 we
test for is_awaiting_semi_sync_ack, which would only be set on the leader
binlog thread.. So then in kill_threads_phase_1, we'd just check if the THD is in
Active_tranx to see if we should skip the kill. 
 

 - Kristian.


In summary
 1) Fix close_connections() check from wait_sessions to instead, always wait for
     the last element in the queue
 2) Fix aborted threads to set their THD to null (with respective NULL thd checks,
     where appropriate)
 3) Fix await_all_slave_replies() to use COND_binlog (signalled on last slave
     reply)
 4) Remove THD::is_awaiting_semisync_ack, change kill_threads_phase_1 to check
     if THD is in Active_tranx.
 5) Fix comment typo s/log_file_name/m_trx_rear/
 6) Remove struct _semisync_wait_trx
 7) Split cond_timewait into two functions, one with a time_spec param, and one
     without
 8) Fix clear_active_tranx_nodes NULL param 


Thanks again,
Brandon