Hi,

Few questions:

(1) Is this review for a full patch or just problems on wsrep_abort_transaction ?
(2) In case at wsrep_abort_transaction we do not have a transaction idea is that we do not anymore want to enter InnoDB i.e. innobase_kill_query,
that is the reason we set MUST_ABORT to wsrep_conflict_state so that no more mutexes from InnoDB is acquired.
(3) In wsrep_close_connections code used LOCK_thread_lock, is that enough to protect THDs on list from concurrent disconnect and delete? (I added locking
for THD::LOCK_thd_data but I was not sure is it necessary)
(4) Do we still keep manual KILL as TOI ?
(5) What I do with thr_lock.c there we pass victim THD pointer to wsrep_abort_thd and it is as you pointed out mostly unprotected on that code,
LOCK TABLE is TOI 
(6) Please note that fix will not be ready 19th of October as we again change critical path, Ramesh will run QA with pquery

On Thu, Oct 14, 2021 at 9:49 PM Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Jan!

Here's an idea of the fix:

Let's always use the KILL mutex locking order, that is

  victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex

For this we need to fix wsrep_abort_transaction(), which is called from the
server, and wsrep_innobase_kill_one_trx(), which is called from BF
thread.

wsrep_abort_transaction() can be fixed by not invoking
wsrep_innobase_kill_one_trx() and always using KILL code path (that is
wsrep_thd_awake) and forcing rollback after the kill.

wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data
at all, just don't lock it. We know that the victim waits on a lock
inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim
cannot go away, cannot modify its data, it cannot do anything. So,
LOCK_thd_data doesn't seem to be necessary at that point.

I've attached a demo patch. It compiles, but I didn't try to run it,
it's only to show the idea, not a working fix (I already suspect I
removed too much from wsrep_abort_transaction()). Note it's the patch
for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix.

On Oct 12, Jan Lindström wrote:
> Hi Sergei,
>
> Update on wsrep_close_connections problem. My suggestion to fix this issue
> is on
> https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebea3b9d
>
> If you have a better solution, please advise.
>
> R: Jan

Regards,
Sergei
VP of MariaDB Server Engineering
and security@mariadb.org