Hi, Jan! Here's an idea of the fix: Let's always use the KILL mutex locking order, that is victim_thread->LOCK_thd_data -> lock_sys->mutex -> victim_trx->mutex For this we need to fix wsrep_abort_transaction(), which is called from the server, and wsrep_innobase_kill_one_trx(), which is called from BF thread. wsrep_abort_transaction() can be fixed by not invoking wsrep_innobase_kill_one_trx() and always using KILL code path (that is wsrep_thd_awake) and forcing rollback after the kill. wsrep_innobase_kill_one_trx() can be fixed by not locking LOCK_thd_data at all, just don't lock it. We know that the victim waits on a lock inside InnoDB and we've locked trx mutex and lock_sys mutex. The victim cannot go away, cannot modify its data, it cannot do anything. So, LOCK_thd_data doesn't seem to be necessary at that point. I've attached a demo patch. It compiles, but I didn't try to run it, it's only to show the idea, not a working fix (I already suspect I removed too much from wsrep_abort_transaction()). Note it's the patch for 10.2 at the commit 29bbcac0ee8^ - that is one commit before my fix. On Oct 12, Jan Lindström wrote:
Hi Sergei,
Update on wsrep_close_connections problem. My suggestion to fix this issue is on https://github.com/MariaDB/server/commit/99cbe03a44cc95e6f548550df51e7201ebe...
If you have a better solution, please advise.
R: Jan
Regards, Sergei VP of MariaDB Server Engineering and security@mariadb.org