commit c51636f254703602f6f6e2e4a260e607e737b9c1 (origin/10.6-MDEV-33551) Author: Brandon Nesterenko <brandon.nesterenko@mariadb.com> Date: Mon Feb 26 07:01:47 2024 -0700
MDEV-33551: Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency
First some overall comments. The approach of waking up each thread on its own condition seems good to me. Monty had a different suggestion, which is to have only a single ack sent per group commit; this could also be a possibility, but it seems orthogonal to your fix (ie. both could make sense). Monty also had the idea to move the AFTER_COMMIT sync point into the group commit code, though that changes the semantics of AFTER_COMMIT somewhat, especially with --innodb-flush-log-at-trx-commit=3 (or non-InnoDB storage engine). So in this review I'll focus on keeping your overall approach the same. Your patch introduces a new wait_queue with the list of waiting threads. But there is already a very similar queue, m_active_tranxs. It looks to me as if the m_active_tranx could easily be extended so it can be used for this, and avoid introducing another queue. Your wait_queue is pushed in write_tranx_in_binlog() which already does m_active_tranxs->insert_tranx_node(), and it is popped in report_reply_binlog() which also does m_active_tranxs->clear_active_tranx_nodes(). So add some info to struct Tranx_node, maybe a THD * is all that is needed? A bit of care must be taken about lifetime of Tranx_node and THD respectively (in case one goes away but not the other). But the whole m_active_tranxs is already protected by a lock, and there is a hash for looking up specific position, so I think a THD that wants to abort its wait can simply clear the THD * from its Tranx_node or something like that to ensure the semisync ack will not try to wake up a THD that no longer exists. You'll need to see how this will affect the code for shutdown (await_all_slave_replies()) where it needs to wait for all acks (IIUC). But I hope we can be pragmatic here and find something simple that works. Another general thing related to this:
3) Repl_semi_sync_master::commit_trx() no longer loops to await its specific ACK. It waits once, and will either fail from timeout, or receive its ACK.
We still need the loop to wait. From `man pthread_cond_wait`: "When using condition variables there is always a Boolean predicate in‐ volving shared variables associated with each condition wait that is true if the thread should proceed. Spurious wakeups from the pthread_cond_timedwait() or pthread_cond_wait() functions may occur. Since the return from pthread_cond_timedwait() or pthread_cond_wait() does not imply anything about the value of this predicate, the predi‐ cate should be re-evaluated upon such return." This affects the code in the patch in some places, detailed comments below.
2) The time when thd::is_awaiting_semi_sync_ack is set is moved to at binlogging time, to ensure transactions which have been binlogged and queued up to await an ACK are not killed, and are still waited on.
I don't understand this. Why shouldn't a session waiting for ACK be killable? On the contrary, such a wait can take potentially a long time, seems important that it can be killed? I don't fully understand the existing code for this, maybe the patch is correct and I'm just looking for an explanation why it is correct like this. Or maybe this is a limitation of the existing code, and your patch doesn't change it (so having the ability to kill a waiting thread could be added in a different patch)?
diff --git a/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test b/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test index f8fa0a99d9c..ebe15a4ca32 100644 --- a/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test +++ b/mysql-test/suite/rpl/t/rpl_semi_sync_cond_var_per_thd.test @@ -22,6 +22,7 @@ --source include/master-slave.inc
--connection master +call mtr.add_suppression("Got an error reading communication packets");
Is this suppression needed? If so, why?
diff --git a/sql/mysqld.cc b/sql/mysqld.cc index e224871795e..b315edc091c 100644 --- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -1750,18 +1750,12 @@ static void close_connections(void) /* If we are waiting on any ACKs, delay killing the thread until either an ACK is received or the timeout is hit.
- while (waiting_threads-- > 0) - repl_semisync_master.await_slave_reply(); + repl_semisync_master.await_all_slave_replies();
Just curious here, why did you change this code? Was it just because now there is no general COND_binlog_send to wait on? Or was it fix a bug not directly related to the main issue? I'm asking because if it makes things simpler, I think it's fine to keep the COND_binlog_send and signal it in addition to the individual thread-specific condition (in fact your patch doesn't seem to remove it though it looks as if it's no longer used). On the other hand, this old code in close_connections() that simply calls await_slave_reply() N times (N = sync_get_master_wait_sessions()) looks quite wrong to me, looping for a specific number of times makes no sense. So fixing it is good in any case.
diff --git a/sql/semisync_master.cc b/sql/semisync_master.cc index 0eaf0f0e0e2..f421e924a61 100644 --- a/sql/semisync_master.cc +++ b/sql/semisync_master.cc @@ -502,16 +502,33 @@ void Repl_semi_sync_master::unlock()
void Repl_semi_sync_master::cond_broadcast() { - mysql_cond_broadcast(&COND_binlog_send); + while (!wait_queue.empty()) + { + semisync_wait_trx_t next_waiter= wait_queue.front();
Maybe this function should now have a better name than cond_broadcast() (wakeup_ready_tranx() or something?). But this code will probably change in any case if the wait_queue is folded into m_active_tranxs.
-int Repl_semi_sync_master::cond_timewait(struct timespec *wait_time) +int Repl_semi_sync_master::cond_timewait(THD *thd, struct timespec *wait_time)
I think this function now makes little sense, clearer just to call directly mysql_cond_wait(thd->COND_wakeup_ready) at the caller, what do you think?
@@ -695,12 +712,36 @@ int Repl_semi_sync_master::report_reply_binlog(uint32 server_id,
+ lock(); + while (!wait_queue.empty()) + { + semisync_wait_trx_t next_waiter= wait_queue.front();
- cond_broadcast(); + cmp= Active_tranx::compare(m_reply_file_name, m_reply_file_pos, + next_waiter.binlog_name, next_waiter.binlog_pos); + if (cmp >= 0) + { + DBUG_PRINT("semisync", ("%s: signal thread %llu.", + "Repl_semi_sync_master::report_reply_binlog", + next_waiter.thd->thread_id)); + mysql_cond_signal(&next_waiter.thd->COND_wakeup_ready); + wait_queue.pop();
This code is very similar to Active_tranx::clear_active_tranx_nodes(), so hopefully the two can be integrated and wait_queue becomes unnecessary.
@@ -779,7 +821,20 @@ int Repl_semi_sync_master::report_binlog_update(THD* thd, const char *log_file, strcpy(log_info->log_file, log_file + dirname_length(log_file)); log_info->log_pos = log_pos;
- return write_tranx_in_binlog(log_info->log_file, log_pos); + /* + THD arg depends on wait point mode. If after storage engine commit, the + individual connection threads will perform the wait for semi-sync ACKt, + thd is the thread of the user connection thread. + If it is after binlog sync, the binlog leader thread will perform the + semi-sync waits on behalf of the grouped transaction (which at this + point, we (current_thd) are the leader). If using binlog_group_commit, + thd is the thread of the user connection thread. + */ + thd_to_cond_wait= ((wait_point() == SEMI_SYNC_MASTER_WAIT_POINT_AFTER_STORAGE_COMMIT) + ? thd + : current_thd);
This condition is complex. It's good that you added the detailed comment explaining what's going on, but I think it would be better to avoid it completely. Why not just pass in the value of thd_to_cond_wait explicitly from the caller? As an aside, we have this code in trx_group_commit_leader(): for (current= queue; current != NULL; current= current->next) { current->error= repl_semisync_master.wait_after_sync(current->cache_mngr-> last_commit_pos_file, current->cache_mngr-> last_commit_pos_offset); It seems silly to wait for every binlog position to be acked her one by one, why not just wait for the last one? repl_semisync_master.wait_after_sync(last_in_queue->cache_mngr->last_commit_pos_file, ...); But while related, it's perhaps not directly part of the problem you're addressing in your patch, so up to you if you want to include this or leave it for a possible later patch.
@@ -852,12 +907,11 @@ int Repl_semi_sync_master::commit_trx(const char* trx_wait_binlog_name,
- if (!get_master_enabled() || !is_on()) + if (!get_master_enabled() || !is_on() || thd_killed(thd)) goto l_end;
- while (is_on() && !thd_killed(thd)) - { - /* We have to check these again as things may have changed */ - if (!rpl_semi_sync_master_clients && !rpl_semi_sync_master_wait_no_slave) - { - aborted= 1; - break; - }
As described earlier, because of the (at least theoretical) possibility of spurious wakeup, we should keep the while loop here and the checks.
@@ -1253,6 +1294,8 @@ int Repl_semi_sync_master::write_tranx_in_binlog(const char* log_file_name, } else { + wait_queue.push({log_file_name, log_file_pos, thd});
Hopefully this can be instead folded into m_active_tranxs->insert_tranx_node() which is called just a bit below.
-void Repl_semi_sync_master::await_slave_reply() +void Repl_semi_sync_master::await_all_slave_replies()
+ while (TRUE) + { + lock(); + if (wait_queue.empty() || !get_master_enabled() || !is_on()) + { + unlock(); + break; + } + front= wait_queue.front(); create_timeout(&abstime, NULL); - cond_timewait(&abstime); - -end: + wait_result= cond_timewait(front.thd, &abstime);
I'm wondering, is it safe here to wait on front.thd->COND_wakeup_ready? Isn't there a possibility that this might go away during the wait? A simple way to avoid this could be just to keep the original global COND_binlog_send and wait for that here (and signal both that and thd->COND_wakeup_ready when waking up). IIUC, this code will only run in a single thread during shutdown, so probably fine to have some extra unnecessary wakeups here.
diff --git a/sql/semisync_master.h b/sql/semisync_master.h index 99f46869354..6de0333d4ce 100644 --- a/sql/semisync_master.h +++ b/sql/semisync_master.h @@ -21,6 +21,7 @@
+/* + Element in Repl_semi_sync_master::wait_queue to preserve the state of a + transaction waiting for an ACK. +*/ +typedef struct _semisync_wait_trx { + const char *binlog_name; + my_off_t binlog_pos; + THD *thd; +} semisync_wait_trx_t;
+ std::queue<semisync_wait_trx_t> wait_queue;
Let's see if we can avoid this, and use the existing m_active_tranxs queue instead. diff --git a/sql/semisync_master.cc b/sql/semisync_master.cc index 8cc721e5737..0eaf0f0e0e2 100644 --- a/sql/semisync_master.cc +++ b/sql/semisync_master.cc @@ -979,6 +979,14 @@ int Repl_semi_sync_master::commit_trx(const char* trx_wait_binlog_name, { rpl_semi_sync_master_trx_wait_num++; rpl_semi_sync_master_trx_wait_time += wait_time; + /* + Assert we have either recieved our ACK; or have timed out and are + awoken in an off state. + */ + DBUG_ASSERT(!get_master_enabled() || !is_on() || thd->is_killed() || + 0 <= Active_tranx::compare( + m_reply_file_name, m_reply_file_pos, + trx_wait_binlog_name, trx_wait_binlog_pos)); } } } I think it's a bit much to crash the server here (even if only debug build), because of the possibility of spurious wakeups. It's good to have the test case though, maybe put the assertion inside DBUG_EXECUTE_IF()? I'd make it just log an error in the error log which the test case could flag (to avoid crashing), but that's a matter of style I guess and up to you. - Kristian.