Hi Jan, Ramesh said that he has not observed any "WSREP: BF lock wait long for trx" messages while running tests. This would suggest that the diagnostic output code is essentially untested. I would say that the diagnostic output is related to MDEV-23328 or MDEV-25114, because it involves the same mutexes as the MDEV-23328 hang. Furthermore, the MDEV-23328 scenario is forced abort of lock-holding lower-priority transactions due to applying certified transactions (called "BF" or "brute force" in Galera). I am glad that you will reconsider my request to remove wsrep_trx_print_locking() and the mutex operations around the call. Marko On Mon, Oct 25, 2021 at 9:04 AM Jan Lindström <jan.lindstrom@mariadb.com> wrote:
Hi Marko,
I am sad to see that my comment regarding wsrep_is_BF_lock_timeout() that I made in https://github.com/MariaDB/server/commit/b74b53f0515b360bb5cddec1a506a2f4d4d... (June 17) has not been addressed. Do we really need that output? Do we see that output in our internal testing? If not, then we have not tested that that code is free from race conditions or hangs. (It should be a lot safer to avoid such unnecessary unlock/lock exercises involving multiple mutexes.) If yes, then why have we not added source code comments to document when such scenarios would occur? I believe that it is better to rely on some operating system features (such as stack traces from a debugger) rather than to try to implement partial logging.
I know, this is not related to MDEV-23328 or MDEV-25114 but ok, I can remove most of this stupid code.
R: Jan
-- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation