Re: [Maria-developers] [Commits] 0f97f6b8398: MDEV-17346 parallel slave start and stop races to workers disappeared
Hi Andrei! andrei.elkin@pp.inet.fi writes:
revision-id: 0f97f6b8398054ccb0507fbacc76c9deeddd47a4 (mariadb-10.1.35-71-g0f97f6b8398) author: Andrei Elkin timestamp: 2018-10-03 15:42:12 +0300 message:
MDEV-17346 parallel slave start and stop races to workers disappeared
Ooh, that's a nice catch. The code around slave start/stop really is terribly tricky, glad that you were able to sort this one out. Patch looks good to push. What about test case? I assume this was caught from occasional random failures in buildbot? That should be ok as test case for this bug, I think. It can be a _lot_ of work to write reliable tests for stuff like this, and the start/stop slave is stressed quite a bit by the existing replication tests.
The bug appears as a slave SQL thread hanging in rpl_parallel_thread_pool::get_thread() while there are no slave worker threads to awake it.
The hang could occur at parallel slave worker pool activation by a "new" started SQL thread when the pool was concurrently deactivated by being terminated "old" SQL thread. At reading the current pool size the SQL thread did not employ necessary protection designed by MDEV-9573. The pool can't be deactivated when there is an active slave but the "new" slave might set its active status too late while seeing the pool still non-empty. The end product of four computaional events
Old_slave:any_slave_sql_running() => 0 New_slave:slave_running= "active" New_slave:global_rpl_thread_pool.size > 0 => true Old_slave:global_rpl_thread_pool.size := 0
could led to the observed hang. The new SQL thread proceeds to scheduling having all workers gone.
Fixed with making the SQL thread at the pool activation first to grab the same lock as potential deactivator also does prior to destroy the pool.
At first I did not see how this fixes the problem. But it is because of the extra check inside rpl_parallel_change_thread_count(): if (!new_count && !force) { if (any_slave_sql_running()) { pool_mark_not_busy(pool); return 0; // Ok to not resize pool } } So one possibility is that New_slave has set running=active when Old_slave hits that test. Then Old_slave will leave the pool intact thanks to this test. The other possibility is that New_slave has not yet set running=active at this point, so Old_slave can empty the pool. But then Old_slave already marked the pool busy. So the patch will ensure that New_slave will see that the pool has been emptied, and will know to re-populate it before starting. Seems solid. Thanks for fixing! - Kristian.
--- sql/rpl_parallel.cc | 25 ++++++++++++++++++++++--- 1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 35cddee6d4d..8fef2d66635 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -1617,13 +1617,32 @@ int rpl_parallel_resize_pool_if_no_slaves(void) }
+/** + Pool activation is preceeded by taking a "lock" of pool_mark_busy + which guarantees the number of running slaves drops to zero atomicly + with the number of pool workers. + This resolves race between the function caller thread and one + that may be attempting to deactivate the pool. +*/ int rpl_parallel_activate_pool(rpl_parallel_thread_pool *pool) { + int rc= 0; + + if ((rc= pool_mark_busy(pool, current_thd))) + return rc; // killed + if (!pool->count) - return rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads, - 0); - return 0; + { + pool_mark_not_busy(pool); + rc= rpl_parallel_change_thread_count(pool, opt_slave_parallel_threads, + 0); + } + else + { + pool_mark_not_busy(pool); + } + return rc; }
participants (1)
-
Kristian Nielsen