Re: [Maria-developers] 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)
Hi, Seppo, Jan! Note, this is 10.2 patch below.
commit 4b164f176e6 Author: Seppo Jaakola <seppo.jaakola@galeracluster.com> Date: Wed Sep 15 09:16:44 2021 +0300
MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)
I think this should say MDEV-23328 Server hang due to Galera lock conflict resolution it'd be easier to understand the commit comment and code changes if it'll refer to the correct problem it is fixing. The fix for MDEV-25114 was the previous commit, the one that This reverts commit 29bbcac0ee841faaa68eeb09c86ff825eabbe6b6 it should have MDEV-25114 in the commit comment.
This patch is the plan D variant for fixing potetial mutex locking
"potential"
order exercised by BF aborting and KILL command execution.
In this approach, KILL command is replicated as TOI operation. This guarantees total isolation for the KILL command execution in the first node: there is no concurrent replication applying and no concurrent DDL executing. Therefore there is no risk of BF aborting to happen in parallel with KILL command execution either. Potential mutex deadlocks between the different mutex access paths with KILL command execution and BF aborting cannot therefore happen.
TOI replication is used, in this approach, purely as means to provide isolated KILL command execution in the first node. KILL command should not (and must not) be applied in secondary nodes. In this patch, we make this sure by skipping KILL execution in secondary nodes, in applying phase, where we bail out if applier thread is trying to execute KILL command. This is effective, but skipping the applying of KILL command could happen much earlier as well.
I like the idea of using TOI apparatus to provide total isolation for KILL. Actually replicating it to nodes is kind of crude, it'd be more elegant not to replicate at all. But it'll do.
This patch also fixes mutex locking order and unprotected THD member accesses on bf aborting case. We try to hold THD::LOCK_thd_data during bf aborting. Only case where it is not possible is at wsrep_abort_transaction before call wsrep_innobase_kill_one_trx where we take InnoDB mutexes first and then THD::LOCK_thd_data.
This will also fix possible race condition during close_connection and while wsrep is disconnecting connections.
Added wsrep_bf_kill_debug test case
Reviewed-by: Jan Lindström <jan.lindstrom@mariadb.com>
diff --git a/sql/sql_class.cc b/sql/sql_class.cc index 5ada018e540..05ec9a1c369 100644 --- a/sql/sql_class.cc +++ b/sql/sql_class.cc @@ -1804,11 +1804,11 @@ void THD::awake(killed_state state_to_set) the Vio might be disassociated concurrently. */
-void THD::disconnect() +void THD::disconnect_mutexed()
should be disconnect_no_mutex, cf. awake_no_mutex, set_killed_no_mutex.
{ Vio *vio= NULL;
- mysql_mutex_lock(&LOCK_thd_data); + mysql_mutex_assert_owner(&LOCK_thd_data);
set_killed(KILL_CONNECTION);
diff --git a/sql/sql_parse.cc b/sql/sql_parse.cc index 3e1f248b082..2bec9c6b6cd 100644 --- a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -9069,6 +9069,18 @@ static void sql_kill(THD *thd, longlong id, killed_state state, killed_type type) { uint error; +#ifdef WITH_WSREP + if (WSREP(thd)) + { + WSREP_DEBUG("sql_kill called"); + if (thd->wsrep_applier) + { + WSREP_DEBUG("KILL in applying, bailing out here"); + return; + } + WSREP_TO_ISOLATION_BEGIN(WSREP_MYSQL_DB, NULL, NULL)
when can it fail and jump to wsrep_error_label? under what conditions?
+ } +#endif /* WITH_WSREP */ if (!(error= kill_one_thread(thd, id, state, type))) { if (!thd->killed) diff --git a/sql/wsrep_mysqld.cc b/sql/wsrep_mysqld.cc index e60100e2e90..00a6dfe2f8a 100644 --- a/sql/wsrep_mysqld.cc +++ b/sql/wsrep_mysqld.cc @@ -835,13 +835,25 @@ void wsrep_thr_init() DBUG_VOID_RETURN; }
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void *victim_thd_ptr, my_bool signal) +{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
+ mysql_mutex_lock(&victim_thd->LOCK_thd_data); + int res= wsrep_abort_thd(bf_thd_ptr, victim_thd_ptr, signal); + mysql_mutex_unlock(&victim_thd->LOCK_thd_data); + return res; +}
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 1f666f19d8a..921b8282e45 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -19739,27 +19739,65 @@ wsrep_abort_transaction( { DBUG_ENTER("wsrep_abort_transaction");
+ ut_ad(bf_thd); + ut_ad(victim_thd); trx_t* victim_trx= thd_to_trx(victim_thd); - trx_t* bf_trx = (bf_thd) ? thd_to_trx(bf_thd) : NULL; + trx_t* bf_trx= thd_to_trx(bf_thd);
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
lock_mutex_enter(); trx_mutex_enter(victim_trx); wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim_trx, signal); lock_mutex_exit(); trx_mutex_exit(victim_trx); wsrep_srv_conc_cancel_wait(victim_trx); + wsrep_thd_LOCK(victim_thd); DBUG_VOID_RETURN; } else { - wsrep_thd_LOCK(victim_thd); wsrep_thd_set_conflict_state(victim_thd, MUST_ABORT); - wsrep_thd_UNLOCK(victim_thd); wsrep_thd_awake(victim_thd, signal); }
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, On Fri, Oct 1, 2021 at 9:05 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Seppo, Jan!
Note, this is 10.2 patch below.
MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)
I think this should say
MDEV-23328 Server hang due to Galera lock conflict resolution
Sure.
+ WSREP_TO_ISOLATION_BEGIN(WSREP_MYSQL_DB, NULL, NULL)
when can it fail and jump to wsrep_error_label? under what conditions?
At least when the node disconnects at a suitable moment.
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void *victim_thd_ptr, my_bool signal) +{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ? R: Jan
Hi, Jan! On Oct 04, Jan Lindström wrote:
Hi Sergei,
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void *victim_thd_ptr, my_bool signal) +{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts: * Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock. * You cannot pass a THD safely over a gap where no locks keep it in existence. You can pass an integer, thread id, and use find_thread_by_id later. By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated? Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, Answers below:
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void *victim_thd_ptr, my_bool signal) +{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
mysys/thr_lock.c there is function thr_lock() there is call to mysql_mutex_lock(&lock->mutex); this is before wsrep_break_lock where we call wsrep_thd_abort
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to release it in my opinion because we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if I do not release I can see easily safe_mutex warnings
* You cannot pass a THD safely over a gap where no locks keep it in existence. You can pass an integer, thread id, and use find_thread_by_id later.
I have a suggestion for this: if (victim_trx) { + + const trx_id_t victim_trx_id= victim_trx->id; + const longlong victim_thread= thd_get_thread_id(victim_thd); lock_mutex_enter(); - trx_mutex_enter(victim_trx); - wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim_trx, signal); + if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) // If this succeeds, trx can not go away + { + trx_mutex_enter(victim); + ut_ad(victim->mysql_thd ? + thd_get_thread_id(victim->mysql_thd) == victim_thread : + 1); + wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim, signal); + trx_mutex_exit(victim); + victim->release_reference(); // Now trx can go away after we have released lock_sys + wsrep_srv_conc_cancel_wait(victim); + } lock_mutex_exit(); - trx_mutex_exit(victim_trx); - wsrep_srv_conc_cancel_wait(victim_trx); DBUG_VOID_RETURN;
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster. R: Jan
Hi, Jan! On Oct 06, Jan Lindström wrote:
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void *victim_thd_ptr, my_bool signal) +{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
mysys/thr_lock.c there is function thr_lock() there is call to mysql_mutex_lock(&lock->mutex); this is before wsrep_break_lock where we call wsrep_thd_abort
this is for table locks. `lock` is `data->lock` where `data` is THR_LOCK structure somewhere in the table share. It locks tables and handlers, not threads. And InnoDB isn't using it at all anyway.
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to release it in my opinion because we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if I do not release I can see easily safe_mutex warnings
I don't understand. First, lock_sys and trx mutexes are not covered by safe_mutex, so it cannot possibly warn about them. Second, the reason for taking mutexes always in the same order is to avoid deadlocks. And yor TOI trick should archieve that.
* You cannot pass a THD safely over a gap where no locks keep it in existence. You can pass an integer, thread id, and use find_thread_by_id later.
I have a suggestion for this:
if (victim_trx) { + + const trx_id_t victim_trx_id= victim_trx->id; + const longlong victim_thread= thd_get_thread_id(victim_thd); lock_mutex_enter(); - trx_mutex_enter(victim_trx); - wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim_trx, signal); + if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) // If this succeeds, trx can not go away + { + trx_mutex_enter(victim); + ut_ad(victim->mysql_thd ? + thd_get_thread_id(victim->mysql_thd) == victim_thread : + 1); + wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim, signal); + trx_mutex_exit(victim); + victim->release_reference(); // Now trx can go away after we have released lock_sys + wsrep_srv_conc_cancel_wait(victim); + } lock_mutex_exit(); - trx_mutex_exit(victim_trx); - wsrep_srv_conc_cancel_wait(victim_trx); DBUG_VOID_RETURN;
I don't understand, sorry. It's not quite clear what this patch should apply to. May be you can paste what the new code should look like, not a patch? In particular, where are wsrep_thd_UNLOCK and wsrep_thd_LOCK. Are they present in this new variant at all?
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster.
So, KILL basically won't work when a cluster is starting. Can bf aborts happen during a cluster startup? Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, Answers to your questions below: On Wed, Oct 6, 2021 at 5:03 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
On Oct 06, Jan Lindström wrote:
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void
*victim_thd_ptr, my_bool signal)
+{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
mysys/thr_lock.c there is function thr_lock() there is call to mysql_mutex_lock(&lock->mutex); this is before wsrep_break_lock where we call wsrep_thd_abort
this is for table locks. `lock` is `data->lock` where `data` is THR_LOCK structure somewhere in the table share. It locks tables and handlers, not threads. And InnoDB isn't using it at all anyway.
I do not know how to change this one so that thd is protected, can I even do so as this code does not know THD internals...
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to release it in my opinion because we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if I do not release I can see easily safe_mutex warnings
I don't understand. First, lock_sys and trx mutexes are not covered by safe_mutex, so it cannot possibly warn about them.
Second, the reason for taking mutexes always in the same order is to avoid deadlocks. And yor TOI trick should archieve that.
static void wsrep_abort_transaction( /*====================*/ handlerton* hton, THD *bf_thd, THD *victim_thd, my_bool signal) { DBUG_ENTER("wsrep_abort_transaction"); trx_t* victim_trx = thd_to_trx(victim_thd); trx_t* bf_trx = (bf_thd) ? thd_to_trx(bf_thd) : NULL; WSREP_DEBUG("abort transaction: BF: %s victim: %s victim conf: %d", wsrep_thd_query(bf_thd), wsrep_thd_query(victim_thd), wsrep_thd_conflict_state(victim_thd, FALSE)); if (victim_trx) { const trx_id_t victim_trx_id= victim_trx->id; const longlong victim_thread= thd_get_thread_id(victim_thd); WSREP_DEBUG("wsrep_abort_transaction: Victim thread %ld " "transaction " TRX_ID_FMT " trx_state %d", thd_get_thread_id(victim_thd), victim_trx->id, victim_trx->state); /* This is necessary as correct mutexing order is lock_sys -> trx -> THD::LOCK_thd_data and below function assumes we have lock_sys and trx locked and takes THD::LOCK_thd_data for THD state check. */ wsrep_thd_UNLOCK(victim_thd); // GAP where thd or trx is not protected DEBUG_SYNC(bf_thd, "wsrep_abort_victim_unlocked"); DBUG_EXECUTE_IF("wsrep_abort_replicated_sleep", WSREP_DEBUG("wsrep_abort_transaction: sleeping " "for thread %ld ", thd_get_thread_id(victim_thd)); lock_mutex_enter(); if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) { // As trx is now referenced it can't go away trx_mutex_enter(victim); ut_ad(victim->mysql_thd ? thd_get_thread_id(victim->mysql_thd) == victim_thread :1); // In below we take THD::LOCK_thd_data wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim, signal); trx_mutex_exit(victim); victim->release_reference(); wsrep_srv_conc_cancel_wait(victim); } lock_mutex_exit(); DBUG_VOID_RETURN; } else { WSREP_DEBUG("victim does not have transaction"); wsrep_thd_LOCK(victim_thd); wsrep_thd_set_conflict_state(victim_thd, MUST_ABORT); wsrep_thd_awake(victim_thd, signal); wsrep_thd_UNLOCK(victim_thd); } DBUG_VOID_RETURN; }
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster.
So, KILL basically won't work when a cluster is starting. Can bf aborts happen during a cluster startup?
Not sure. R: Jan
Hi, Jan! On Oct 06, Jan Lindström wrote:
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
this is for table locks. `lock` is `data->lock` where `data` is THR_LOCK structure somewhere in the table share. It locks tables and handlers, not threads. And InnoDB isn't using it at all anyway.
I do not know how to change this one so that thd is protected, can I even do so as this code does not know THD internals...
No, that was my point. thr_lock is for *table locks*, you cannot use it to protect a THD, they protect TABLE and TABLE_SHARE. Basically, it's irrelevant for this MDEV.
> if (victim_trx) { > + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to release it in my opinion because we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if I do not release I can see easily safe_mutex warnings
I don't understand. First, lock_sys and trx mutexes are not covered by safe_mutex, so it cannot possibly warn about them.
Second, the reason for taking mutexes always in the same order is to avoid deadlocks. And yor TOI trick should archieve that.
static void wsrep_abort_transaction( handlerton* hton, THD *bf_thd, THD *victim_thd, my_bool signal) { trx_t* victim_trx = thd_to_trx(victim_thd); trx_t* bf_trx = (bf_thd) ? thd_to_trx(bf_thd) : NULL;
if (victim_trx) { const trx_id_t victim_trx_id= victim_trx->id; const longlong victim_thread= thd_get_thread_id(victim_thd); /* This is necessary as correct mutexing order is lock_sys -> trx -> THD::LOCK_thd_data and below function assumes we have lock_sys and trx locked and takes THD::LOCK_thd_data for THD state check. */ wsrep_thd_UNLOCK(victim_thd); // GAP where thd or trx is not protected lock_mutex_enter(); if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) {
trx_rw_is_active needs to be modified to do that, right?
// As trx is now referenced it can't go away
Hmm. What happens if the thd that owns this transaction is killed or the user disconnects? THD gets freed. What happens to the referenced trx?
trx_mutex_enter(victim); // In below we take THD::LOCK_thd_data
"we take victim->mysql_thd->LOCK_thd_data", correct?
wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim, signal); trx_mutex_exit(victim); victim->release_reference(); wsrep_srv_conc_cancel_wait(victim); } lock_mutex_exit(); DBUG_VOID_RETURN; } else { wsrep_thd_LOCK(victim_thd); wsrep_thd_set_conflict_state(victim_thd, MUST_ABORT); wsrep_thd_awake(victim_thd, signal); wsrep_thd_UNLOCK(victim_thd); } DBUG_VOID_RETURN; }
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster.
So, KILL basically won't work when a cluster is starting. Can bf aborts happen during a cluster startup?
Not sure.
What I mean it, what if KILL would ignore WSREP_TO_ISOLATION_BEGIN failure and will just proceed killing? Perhaps if WSREP_TO_ISOLATION_BEGIN fails it means that there can be no bf aborts anyway? Could you try to find it out? Regards, Sergei
Hi Sergei,
if (victim_trx) { const trx_id_t victim_trx_id= victim_trx->id; const longlong victim_thread= thd_get_thread_id(victim_thd); /* This is necessary as correct mutexing order is lock_sys -> trx -> THD::LOCK_thd_data and below function assumes we have lock_sys and trx locked and takes THD::LOCK_thd_data for THD state check. */ wsrep_thd_UNLOCK(victim_thd); // GAP where thd or trx is not protected lock_mutex_enter(); if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) {
trx_rw_is_active needs to be modified to do that, right?
No this is current behaviour, I did not change anything on trx_rw_is_active
// As trx is now referenced it can't go away
Hmm. What happens if the thd that owns this transaction is killed or the user disconnects? THD gets freed. What happens to the referenced trx?
In my understanding you can't just free THD before it is aborted or committed, right ? As we have lock_sys, no trx can commit or abort inside InnoDB, and after this function this trx can't be deleted.
trx_mutex_enter(victim); // In below we take THD::LOCK_thd_data
"we take victim->mysql_thd->LOCK_thd_data", correct?
Yes
What I mean it, what if KILL would ignore WSREP_TO_ISOLATION_BEGIN failure and will just proceed killing? Perhaps if WSREP_TO_ISOLATION_BEGIN fails it means that there can be no bf aborts anyway? Could you try to find it out?
User KILL can happen only after the node has moded to READY state so at startup you can't use it before the cluster is ready to serve. We could just ignore the TOI error here, but what is the point? There are bigger problems in the cluster if TOI fails. TOI can fail only in this node as all other nodes in the cluster will ignore the KILL command (after parsing it). R: Jan
Hi, Jan! On Oct 10, Jan Lindström wrote:
Hi Sergei,
if (victim_trx) { const trx_id_t victim_trx_id= victim_trx->id; const longlong victim_thread= thd_get_thread_id(victim_thd); /* This is necessary as correct mutexing order is lock_sys -> trx -> THD::LOCK_thd_data and below function assumes we have lock_sys and trx locked and takes THD::LOCK_thd_data for THD state check. */ wsrep_thd_UNLOCK(victim_thd); // GAP where thd or trx is not protected lock_mutex_enter(); if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) {
trx_rw_is_active needs to be modified to do that, right?
No this is current behaviour, I did not change anything on trx_rw_is_active
In xtradb trx_rw_is_active returns bool. I think xtradb is still the default innodb in 10.2. In innobase it returns, indeed, trx_t*, I didn't notice that at first, that's why I was confused.
// As trx is now referenced it can't go away
Hmm. What happens if the thd that owns this transaction is killed or the user disconnects? THD gets freed. What happens to the referenced trx?
In my understanding you can't just free THD before it is aborted or committed, right ? As we have lock_sys, no trx can commit or abort inside InnoDB, and after this function this trx can't be deleted.
okay, good point.
What I mean it, what if KILL would ignore WSREP_TO_ISOLATION_BEGIN failure and will just proceed killing? Perhaps if WSREP_TO_ISOLATION_BEGIN fails it means that there can be no bf aborts anyway? Could you try to find it out?
User KILL can happen only after the node has moded to READY state so at startup you can't use it before the cluster is ready to serve. We could just ignore the TOI error here, but what is the point? There are bigger problems in the cluster if TOI fails. TOI can fail only in this node as all other nodes in the cluster will ignore the KILL command (after parsing it).
Okay then Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei,
trx_rw_is_active needs to be modified to do that, right?
No this is current behaviour, I did not change anything on trx_rw_is_active
In xtradb trx_rw_is_active returns bool. I think xtradb is still the default innodb in 10.2.
In innobase it returns, indeed, trx_t*, I didn't notice that at first, that's why I was confused.
xtradb is not anymore used at all, not even 10.2. R: Jan
Hi Sergei, Update on what happens after TOI failure.
What I mean it, what if KILL would ignore WSREP_TO_ISOLATION_BEGIN failure and will just proceed killing? Perhaps if WSREP_TO_ISOLATION_BEGIN fails it means that there can be no bf aborts anyway? Could you try to find it out?
After TOI error that node will not serve user anymore, you either get ERROR 40001: WSREP replication failed. Check your wsrep connection state and retry the query. or ERROR 08S01: WSREP has not yet prepared node for application use User KILL will return the first one. bf_kill can't happen as you can't issue commands to cause it. R: Jan
Update on disconnect
// As trx is now referenced it can't go away
Hmm. What happens if the thd that owns this transaction is killed or the user disconnects? THD gets freed. What happens to the referenced trx?
I created new mtr-tests (galera_disconnect_debug) to try disconnecting victim connections on several debug sync points on this code this place included and could not reproduce any crash. Thus, transaction can't be killed while we are here or disconnect will not free THD. I created a new mtr-test (galera_to_error) also for the TOI error case and verified that after that you can't issue statements anymore including KILL QUERY. R: Jan
Hi, Jan! Great, thanks! On Oct 11, Jan Lindström wrote:
Update on disconnect
// As trx is now referenced it can't go away
Hmm. What happens if the thd that owns this transaction is killed or the user disconnects? THD gets freed. What happens to the referenced trx?
I created new mtr-tests (galera_disconnect_debug) to try disconnecting victim connections on several debug sync points on this code this place included and could not reproduce any crash. Thus, transaction can't be killed while we are here or disconnect will not free THD.
I created a new mtr-test (galera_to_error) also for the TOI error case and verified that after that you can't issue statements anymore including KILL QUERY.
R: Jan Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, After QA runs done by Ramesh, we now know the latest fix candidate i.e. what is in bb-10.2-MDEV-25114-galera-v2 is incorrect. Problem is in wsrep_close_connections() as it holds LOCK_thread_count while it does abort_replicated that will call wsrep_abort_transaction and there we use find_thread_by_id that would also take LOCK_thread_count. As there is another code path here, the problem is not easily fixed. We can't just release LOCK_thread_count at wsrep_close_connections as we iterate the thread list. I must say I'm not sure what to do now. (gdb) bt #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x000056333963a2e8 in my_write_core (sig=sig@entry=6) at /test/mtest/10.2_dbg/mysys/stacktrace.c:382 #2 0x0000563338f2993d in handle_fatal_signal (sig=6) at /test/mtest/10.2_dbg/sql/signal_handler.cc:355 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x000014b799572859 in __GI_abort () at abort.c:79 #6 0x000056333963edc4 in safe_mutex_lock (mp=0x563339e47220 <LOCK_thread_count>, my_flags=my_flags@entry=0, file=file@entry=0x5633396d9050 "/test/mtest/10.2_dbg/sql/sql_parse.cc", line=line@entry=8902) at /test/mtest/10.2_dbg/mysys/thr_mutex.c:264 #7 0x0000563338d1aea7 in inline_mysql_mutex_lock (src_line=8902, src_file=0x5633396d9050 "/test/mtest/10.2_dbg/sql/sql_parse.cc", that=<optimized out>) at /test/mtest/10.2_dbg/include/mysql/psi/mysql_thread.h:688 #8 find_thread_by_id (id=id@entry=48, query_id=query_id@entry=false) at /test/mtest/10.2_dbg/sql/sql_parse.cc:8902 #9 0x0000563339134c39 in wsrep_abort_transaction (hton=<optimized out>, bf_thd=0x14b680000d90, victim_thd=<optimized out>, signal=<optimized out>) at /test/mtest/10.2_dbg/storage/innobase/handler/ha_innodb.cc:19821 #10 0x0000563338f38cbf in ha_abort_transaction (bf_thd=bf_thd@entry=0x14b680000d90, victim_thd=victim_thd@entry=0x14b680000d90, signal=signal@entry=1 '\001') at /test/mtest/10.2_dbg/sql/handler.cc:6327 #11 0x0000563338ebed6d in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x14b680000d90, victim_thd_ptr=victim_thd_ptr@entry=0x14b680000d90, signal=signal@entry=1 '\001') at /test/mtest/10.2_dbg/sql/wsrep_thd.cc:832 #12 0x0000563338eaa2bd in abort_replicated (thd=thd@entry=0x14b680000d90) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:2269 #13 0x0000563338eae097 in wsrep_close_client_connections (wait_to_end=wait_to_end@entry=1 '\001', except_caller_thd=except_caller_thd@entry=0x0) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:2437 #14 0x0000563338eaedf6 in wsrep_stop_replication (thd=thd@entry=0x0) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:962 #15 0x0000563338c543d8 in kill_server (sig_ptr=sig_ptr@entry=0x0) at /test/mtest/10.2_dbg/sql/mysqld.cc:2009 #16 0x0000563338c558d5 in kill_server_thread (arg=<optimized out>) at /test/mtest/10.2_dbg/sql/mysqld.cc:2047 #17 0x000014b799a7a609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #18 0x000014b79966f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 R: Jan On Wed, Oct 6, 2021 at 5:03 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
On Oct 06, Jan Lindström wrote:
+/* This is wrapper for wsrep_break_lock in thr_lock.c */ +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void
*victim_thd_ptr, my_bool signal)
+{ + THD* victim_thd= (THD *) victim_thd_ptr; + /* We need to lock THD::LOCK_thd_data to protect victim + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
mysys/thr_lock.c there is function thr_lock() there is call to mysql_mutex_lock(&lock->mutex); this is before wsrep_break_lock where we call wsrep_thd_abort
this is for table locks. `lock` is `data->lock` where `data` is THR_LOCK structure somewhere in the table share. It locks tables and handlers, not threads. And InnoDB isn't using it at all anyway.
if (victim_trx) { + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to release it in my opinion because we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if I do not release I can see easily safe_mutex warnings
I don't understand. First, lock_sys and trx mutexes are not covered by safe_mutex, so it cannot possibly warn about them.
Second, the reason for taking mutexes always in the same order is to avoid deadlocks. And yor TOI trick should archieve that.
* You cannot pass a THD safely over a gap where no locks keep it in existence. You can pass an integer, thread id, and use find_thread_by_id later.
I have a suggestion for this:
if (victim_trx) { + + const trx_id_t victim_trx_id= victim_trx->id; + const longlong victim_thread= thd_get_thread_id(victim_thd); lock_mutex_enter(); - trx_mutex_enter(victim_trx); - wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim_trx, signal); + if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) // If this succeeds, trx can not go away + { + trx_mutex_enter(victim); + ut_ad(victim->mysql_thd ? + thd_get_thread_id(victim->mysql_thd) == victim_thread : + 1); + wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim, signal); + trx_mutex_exit(victim); + victim->release_reference(); // Now trx can go away after we have released lock_sys + wsrep_srv_conc_cancel_wait(victim); + } lock_mutex_exit(); - trx_mutex_exit(victim_trx); - wsrep_srv_conc_cancel_wait(victim_trx); DBUG_VOID_RETURN;
I don't understand, sorry. It's not quite clear what this patch should apply to. May be you can paste what the new code should look like, not a patch?
In particular, where are wsrep_thd_UNLOCK and wsrep_thd_LOCK. Are they present in this new variant at all?
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster.
So, KILL basically won't work when a cluster is starting. Can bf aborts happen during a cluster startup?
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, Update on wsrep_close_connections problem. My suggestion to fix this issue is on https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe... If you have a better solution, please advise. R: Jan On Mon, Oct 11, 2021 at 12:52 PM Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Hi Sergei,
After QA runs done by Ramesh, we now know the latest fix candidate i.e. what is in bb-10.2-MDEV-25114-galera-v2 is incorrect. Problem is in wsrep_close_connections() as it holds LOCK_thread_count while it does abort_replicated that will call wsrep_abort_transaction and there we use find_thread_by_id that would also take LOCK_thread_count. As there is another code path here, the problem is not easily fixed. We can't just release LOCK_thread_count at wsrep_close_connections as we iterate the thread list.
I must say I'm not sure what to do now.
(gdb) bt #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x000056333963a2e8 in my_write_core (sig=sig@entry=6) at /test/mtest/10.2_dbg/mysys/stacktrace.c:382 #2 0x0000563338f2993d in handle_fatal_signal (sig=6) at /test/mtest/10.2_dbg/sql/signal_handler.cc:355 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x000014b799572859 in __GI_abort () at abort.c:79 #6 0x000056333963edc4 in safe_mutex_lock (mp=0x563339e47220 <LOCK_thread_count>, my_flags=my_flags@entry=0, file=file@entry=0x5633396d9050 "/test/mtest/10.2_dbg/sql/sql_parse.cc", line=line@entry=8902) at /test/mtest/10.2_dbg/mysys/thr_mutex.c:264 #7 0x0000563338d1aea7 in inline_mysql_mutex_lock (src_line=8902, src_file=0x5633396d9050 "/test/mtest/10.2_dbg/sql/sql_parse.cc", that=<optimized out>) at /test/mtest/10.2_dbg/include/mysql/psi/mysql_thread.h:688 #8 find_thread_by_id (id=id@entry=48, query_id=query_id@entry=false) at /test/mtest/10.2_dbg/sql/sql_parse.cc:8902 #9 0x0000563339134c39 in wsrep_abort_transaction (hton=<optimized out>, bf_thd=0x14b680000d90, victim_thd=<optimized out>, signal=<optimized out>) at /test/mtest/10.2_dbg/storage/innobase/handler/ha_innodb.cc:19821 #10 0x0000563338f38cbf in ha_abort_transaction (bf_thd=bf_thd@entry=0x14b680000d90, victim_thd=victim_thd@entry=0x14b680000d90, signal=signal@entry=1 '\001') at /test/mtest/10.2_dbg/sql/handler.cc:6327 #11 0x0000563338ebed6d in wsrep_abort_thd (bf_thd_ptr=bf_thd_ptr@entry=0x14b680000d90, victim_thd_ptr=victim_thd_ptr@entry=0x14b680000d90, signal=signal@entry=1 '\001') at /test/mtest/10.2_dbg/sql/wsrep_thd.cc:832 #12 0x0000563338eaa2bd in abort_replicated (thd=thd@entry=0x14b680000d90) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:2269 #13 0x0000563338eae097 in wsrep_close_client_connections (wait_to_end=wait_to_end@entry=1 '\001', except_caller_thd=except_caller_thd@entry=0x0) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:2437 #14 0x0000563338eaedf6 in wsrep_stop_replication (thd=thd@entry=0x0) at /test/mtest/10.2_dbg/sql/wsrep_mysqld.cc:962 #15 0x0000563338c543d8 in kill_server (sig_ptr=sig_ptr@entry=0x0) at /test/mtest/10.2_dbg/sql/mysqld.cc:2009 #16 0x0000563338c558d5 in kill_server_thread (arg=<optimized out>) at /test/mtest/10.2_dbg/sql/mysqld.cc:2047 #17 0x000014b799a7a609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #18 0x000014b79966f293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
R: Jan
On Wed, Oct 6, 2021 at 5:03 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
On Oct 06, Jan Lindström wrote:
> +/* This is wrapper for wsrep_break_lock in thr_lock.c */ > +static int wsrep_thr_abort_thd(void *bf_thd_ptr, void
*victim_thd_ptr, my_bool signal)
> +{ > + THD* victim_thd= (THD *) victim_thd_ptr; > + /* We need to lock THD::LOCK_thd_data to protect victim > + from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are mysql_mutex_lock() calls to lock->mutex. After code review it is not clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
mysys/thr_lock.c there is function thr_lock() there is call to mysql_mutex_lock(&lock->mutex); this is before wsrep_break_lock where we call wsrep_thd_abort
this is for table locks. `lock` is `data->lock` where `data` is THR_LOCK structure somewhere in the table share. It locks tables and handlers, not threads. And InnoDB isn't using it at all anyway.
> if (victim_trx) { > + wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole point of using TOI was to make sure that even if you lock mutexes in a different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to release it in my opinion because we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if I do not release I can see easily safe_mutex warnings
I don't understand. First, lock_sys and trx mutexes are not covered by safe_mutex, so it cannot possibly warn about them.
Second, the reason for taking mutexes always in the same order is to avoid deadlocks. And yor TOI trick should archieve that.
* You cannot pass a THD safely over a gap where no locks keep it in existence. You can pass an integer, thread id, and use find_thread_by_id later.
I have a suggestion for this:
if (victim_trx) { + + const trx_id_t victim_trx_id= victim_trx->id; + const longlong victim_thread= thd_get_thread_id(victim_thd); lock_mutex_enter(); - trx_mutex_enter(victim_trx); - wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim_trx, signal); + if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL, true)) // If this succeeds, trx can not go away + { + trx_mutex_enter(victim); + ut_ad(victim->mysql_thd ? + thd_get_thread_id(victim->mysql_thd) == victim_thread : + 1); + wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim, signal); + trx_mutex_exit(victim); + victim->release_reference(); // Now trx can go away after we have released lock_sys + wsrep_srv_conc_cancel_wait(victim); + } lock_mutex_exit(); - trx_mutex_exit(victim_trx); - wsrep_srv_conc_cancel_wait(victim_trx); DBUG_VOID_RETURN;
I don't understand, sorry. It's not quite clear what this patch should apply to. May be you can paste what the new code should look like, not a patch?
In particular, where are wsrep_thd_UNLOCK and wsrep_thd_LOCK. Are they present in this new variant at all?
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take? Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster.
So, KILL basically won't work when a cluster is starting. Can bf aborts happen during a cluster startup?
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Jan! Here's an idea of the fix: Let's always use the KILL mutex locking order, that is victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread. wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill. wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point. I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix. On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Few questions: (1) Is this review for a full patch or just problems on wsrep_abort_transaction ? (2) In case at wsrep_abort_transaction we do not have a transaction idea is that we do not anymore want to enter InnoDB i.e. innobase_kill_query, that is the reason we set MUST_ABORT to wsrep_conflict_state so that no more mutexes from InnoDB is acquired. (3) In wsrep_close_connections code used LOCK_thread_lock, is that enough to protect THDs on list from concurrent disconnect and delete? (I added locking for THD::LOCK_thd_data but I was not sure is it necessary) (4) Do we still keep manual KILL as TOI ? (5) What I do with thr_lock.c there we pass victim THD pointer to wsrep_abort_thd and it is as you pointed out mostly unprotected on that code, LOCK TABLE is TOI (6) Please note that fix will not be ready 19th of October as we again change critical path, Ramesh will run QA with pquery On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
Here's an idea of the fix:
Let's always use the KILL mutex locking order, that is
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread.
wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill.
wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point.
I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on
https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi, Jan! On Oct 15, Jan Lindström wrote:
Few questions:
(1) Is this review for a full patch or just problems on wsrep_abort_transaction ?
a full patch
(2) In case at wsrep_abort_transaction we do not have a transaction idea is that we do not anymore want to enter InnoDB i.e. innobase_kill_query, that is the reason we set MUST_ABORT to wsrep_conflict_state so that no more mutexes from InnoDB is acquired.
right, sorry. I thought I've removed too much from wsrep_abort_transaction.
(3) In wsrep_close_connections code used LOCK_thread_lock, is that enough to protect THDs on list from concurrent disconnect and delete? (I added locking for THD::LOCK_thd_data but I was not sure is it necessary)
Isn't it a separate issue? Anyway, as far as I understand, LOCK_thread_count protects the list of threads, a THD cannot be added to or deleted from the list. but a THD can be disconnected or killed and it'll wait somewhere at the very end of the destructor to remove itself from the list. So, yes, you need to take LOCK_thd_data or LOCK_thd_kill too.
(4) Do we still keep manual KILL as TOI ?
Not in my patch, no.
(5) What I do with thr_lock.c there we pass victim THD pointer to wsrep_abort_thd and it is as you pointed out mostly unprotected on that code, LOCK TABLE is TOI
It's fine. Same logic as in wsrep_innobase_kill_one_trx(). You kill a victim thd that owns thr locks. This THD cannot just disappear, it has to release those locks first. And you own the lock->mutex. So the THD isn't going anywhere until you return from thr_lock().
(6) Please note that fix will not be ready 19th of October as we again change critical path, Ramesh will run QA with pquery
sure
On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
Here's an idea of the fix:
Let's always use the KILL mutex locking order, that is
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread.
wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill.
wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point.
I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on
https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, I have implemented PlanE as agreed on branch bb-10.2-MDEV-25114-planE-galera and mostly regression testing looks promising. However, I have problems with MDL-locks. For example test case galera.galera_toi_lock_exclusive hangs and I have not yet found out why. I will ask help from Seppo. R: Jan On Fri, Oct 15, 2021 at 11:36 AM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
On Oct 15, Jan Lindström wrote:
Few questions:
(1) Is this review for a full patch or just problems on wsrep_abort_transaction ?
a full patch
(2) In case at wsrep_abort_transaction we do not have a transaction idea is that we do not anymore want to enter InnoDB i.e. innobase_kill_query, that is the reason we set MUST_ABORT to wsrep_conflict_state so that no more mutexes from InnoDB is acquired.
right, sorry. I thought I've removed too much from wsrep_abort_transaction.
(3) In wsrep_close_connections code used LOCK_thread_lock, is that enough to protect THDs on list from concurrent disconnect and delete? (I added locking for THD::LOCK_thd_data but I was not sure is it necessary)
Isn't it a separate issue?
Anyway, as far as I understand, LOCK_thread_count protects the list of threads, a THD cannot be added to or deleted from the list.
but a THD can be disconnected or killed and it'll wait somewhere at the very end of the destructor to remove itself from the list. So, yes, you need to take LOCK_thd_data or LOCK_thd_kill too.
(4) Do we still keep manual KILL as TOI ?
Not in my patch, no.
(5) What I do with thr_lock.c there we pass victim THD pointer to wsrep_abort_thd and it is as you pointed out mostly unprotected on that code, LOCK TABLE is TOI
It's fine. Same logic as in wsrep_innobase_kill_one_trx(). You kill a victim thd that owns thr locks. This THD cannot just disappear, it has to release those locks first. And you own the lock->mutex. So the THD isn't going anywhere until you return from thr_lock().
(6) Please note that fix will not be ready 19th of October as we again change critical path, Ramesh will run QA with pquery
sure
On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
Here's an idea of the fix:
Let's always use the KILL mutex locking order, that is
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread.
wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill.
wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point.
I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on
https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, This does not seem to work. Consider following: CREATE TABLE t1 (id INT PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); connection node_2; SET AUTOCOMMIT=OFF; START TRANSACTION; INSERT INTO t1 VALUES (2); connection node_2a; ALTER TABLE t1 ADD COLUMN f2 INTEGER, LOCK=EXCLUSIVE; Problem seems to be the fact that the MDL-lock acquired by thread executing INSERT that thread executing ALTER wants to be released by killing holder is not released. We do kill query inside InnoDB. This MDL-code is not familiar to me and I do not yet understand why MDL-lock is not released R: Jan On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
Here's an idea of the fix:
Let's always use the KILL mutex locking order, that is
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread.
wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill.
wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point.
I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on
https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
Hi Sergei, Your suggestion does not work. There are more than one problem (1) wsrep_abort_transaction does not release MDL-lock (2) innobase_kill_one_trx crashes at wsrep->abort_pre_commit() because transaction registered inside wsrep has disappeared (this does not happen if THD::LOCK_thd_data is locked) ==> I will use Seppo's KILL as TOI and remove all unrelated changes from mdl.cc and wsrep_close_connections, they are not related to problems we need to fix. Let's take one problem at a time. TOI is a very powerful solution to the problem we are trying to fix. R: Jan On Thu, Oct 21, 2021 at 7:52 AM Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Hi Sergei,
This does not seem to work. Consider following:
CREATE TABLE t1 (id INT PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); connection node_2; SET AUTOCOMMIT=OFF; START TRANSACTION; INSERT INTO t1 VALUES (2); connection node_2a; ALTER TABLE t1 ADD COLUMN f2 INTEGER, LOCK=EXCLUSIVE;
Problem seems to be the fact that the MDL-lock acquired by thread executing INSERT that thread executing ALTER wants to be released by killing holder is not released. We do kill query inside InnoDB. This MDL-code is not familiar to me and I do not yet understand why MDL-lock is not released
R: Jan
On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!
Here's an idea of the fix:
Let's always use the KILL mutex locking order, that is
victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex
For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread.
wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill.
wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point.
I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.
On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on
https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org
participants (2)
-
Jan Lindström
-
Sergei Golubchik