Re: [Maria-developers] 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

6 Oct 2021


      Hi Sergei,

Answers below:
...
...
...
...
+/* This is wrapper for wsrep_break_lock in thr_lock.c */
+static int wsrep_thr_abort_thd(void *bf_thd_ptr, void
*victim_thd_ptr, my_bool signal)
+{
+  THD* victim_thd= (THD *) victim_thd_ptr;
+  /* We need to lock THD::LOCK_thd_data to protect victim
+  from concurrent usage or disconnect or delete. */
How do you know victim_thd wasn't deleted before you locked
LOCK_thd_data below?
I must say the thr_lock code is not familiar to me but there are
mysql_mutex_lock() calls to lock->mutex. After code review  it is not
clear to me what that mutex is.
where are mysql_mutex_lock() calls to lock->mutex?
mysys/thr_lock.c there is function thr_lock() there is call to
mysql_mutex_lock(&lock->mutex); this is before wsrep_break_lock where we
call wsrep_thd_abort
...
...
...
...
if (victim_trx) {
+                wsrep_thd_UNLOCK(victim_thd);
what keeps victim_trx from disappearing here?
Nothing. Do you have suggestions ?
A couple of thoughts:
* Why do you unlock LOCK_thd_data here at all? I believed the whole
  point of using TOI was to make sure that even if you lock mutexes in a
  different order, it will not cause a deadlock.
This is DDL-case when we have a MDL-conflict not user KILL, we need to
release it in my opinion because
we need to take mutexes in lock_sys -> trx -> THD::LOCK_thd_data order, if
I do not release I can see easily
safe_mutex warnings
...
* You cannot pass a THD safely over a gap where no locks keep it in
  existence. You can pass an integer, thread id, and use
  find_thread_by_id later.
I have a suggestion for this:

         if (victim_trx) {
+
+               const trx_id_t victim_trx_id= victim_trx->id;
+               const longlong victim_thread= thd_get_thread_id(victim_thd);
                lock_mutex_enter();
-               trx_mutex_enter(victim_trx);
-               wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim_trx,
signal);
+               if (trx_t* victim= trx_rw_is_active(victim_trx_id, NULL,
true)) // If this succeeds, trx can not go away
+               {
+                       trx_mutex_enter(victim);
+                       ut_ad(victim->mysql_thd ?
+                               thd_get_thread_id(victim->mysql_thd) ==
victim_thread :
+                               1);
+                       wsrep_innobase_kill_one_trx(bf_thd, bf_trx, victim,
signal);
+                       trx_mutex_exit(victim);
+                       victim->release_reference(); // Now trx can go away
after we have released lock_sys
+                       wsrep_srv_conc_cancel_wait(victim);
+               }
                lock_mutex_exit();
-               trx_mutex_exit(victim_trx);
-               wsrep_srv_conc_cancel_wait(victim_trx);
                DBUG_VOID_RETURN;
...
By the way, how long can WSREP_TO_ISOLATION_BEGIN() take?
Does it need to wait for everything else being replicated?
As long as it takes to start it on all nodes in the cluster.

R: Jan

Re: [Maria-developers] 4b164f176e6: MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

Jan Lindström