Locking thd->LOCK_thd_data during group commit
Hi Serg, I came upon this commit by you:
commit 6b685ea7b0776430d45b095cb4be3ef0739a3c04 Author: Sergei Golubchik <serg@mariadb.org> Date: Wed Sep 28 18:55:15 2022 +0200
correctness assert
thd_get_ha_data() can be used without a lock, but only from the current thd thread, when calling from anoher thread it *must* be protected by thd->LOCK_thd_data
* fix group commit code to take thd->LOCK_thd_data * remove innobase_close_connection() from the innodb background thread, it's not needed after 87775402cd0c and was failing the assert with current_thd==0
@@ -8512,7 +8512,11 @@ MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader) ++num_commits; if (current->cache_mngr->using_xa && likely(!current->error) && DBUG_EVALUATE_IF("skip_commit_ordered", 0, 1)) + { + mysql_mutex_lock(¤t->thd->LOCK_thd_data); run_commit_ordered(current->thd, current->all); + mysql_mutex_unlock(¤t->thd->LOCK_thd_data); + }
This seems _really_ expensive :-( This code runs for every transaction commit in replication, and it runs under a global LOCK_commit_ordered which needs to be held for as short as possible. The commit message doesn't mention anything about what goes wrong during the group commit code. And the patch doesn't have any test case that shows the problem, it just adds an assertion that will trigger during the group commit. So what is the actual reason for this change? Note that at this point, the group commit leader thread _is_ effectively acting as if it was each of the participating threads, the other threads are blocked from running during commit_ordered. If it is just so that we can have this assertion, then that needs to be rolled back. We _really_ don't want to take/release N mutexes in release builds during LOCK_commit_ordered, just to have an assertion in debug builds. If there is some reason, then the correct fix could be to set current_thd for the duration of run_commit_ordered, which will satisfy the assertion and presumably the actual bug? Well, it depends _what_ the actual reason/bug is. - Kristian.
Kristian Nielsen via developers <developers@lists.mariadb.org> writes:
If it is just so that we can have this assertion, then that needs to be rolled back. We _really_ don't want to take/release N mutexes in release builds during LOCK_commit_ordered, just to have an assertion in debug builds.
I mean, it's fine to have the assertion to catch bugs, just we need to find some way to do it without the cost of mutex take/release under LOCK_commit_ordered. - Kristian.
Hi, Kristian, On Oct 20, Kristian Nielsen wrote:
Hi Serg,
I came upon this commit by you:
commit 6b685ea7b0776430d45b095cb4be3ef0739a3c04 Author: Sergei Golubchik <serg@mariadb.org> Date: Wed Sep 28 18:55:15 2022 +0200
correctness assert
thd_get_ha_data() can be used without a lock, but only from the current thd thread, when calling from anoher thread it *must* be protected by thd->LOCK_thd_data
* fix group commit code to take thd->LOCK_thd_data * remove innobase_close_connection() from the innodb background thread, it's not needed after 87775402cd0c and was failing the assert with current_thd==0
@@ -8512,7 +8512,11 @@ MYSQL_BIN_LOG::trx_group_commit_leader(group_commit_entry *leader) ++num_commits; if (current->cache_mngr->using_xa && likely(!current->error) && DBUG_EVALUATE_IF("skip_commit_ordered", 0, 1)) + { + mysql_mutex_lock(¤t->thd->LOCK_thd_data); run_commit_ordered(current->thd, current->all); + mysql_mutex_unlock(¤t->thd->LOCK_thd_data); + }
This seems _really_ expensive :-( This code runs for every transaction commit in replication, and it runs under a global LOCK_commit_ordered which needs to be held for as short as possible.
Monty asserted many times that taking an uncontended mutex is cheap. Did you think it's expensive or did you benchmark that?
So what is the actual reason for this change?
This is just a protocol for accessing mostly-read-only thread-local variable. Either we can protect all accesses with a mutex. Which will heavily penalize the thread owner of the variable. This commit enforced a different protocol: the owner can read the value without a mutex, and only take the mutex when modifying it (which is infequent). A different thread can only read the variable and only with a mutex (this is also very infequent). I don't remember what code was violating it, likely something wsrep-specific.
Note that at this point, the group commit leader thread _is_ effectively acting as if it was each of the participating threads, the other threads are blocked from running during commit_ordered.
I'm not sure it's good enough. Practically it might work, but if one thread has modified some variable under a mutex, you cannot simply read it from another thread without a mutex, there's no guarantee that another thread will see the new value. You'll need an appropriate memory barrier for that.
If there is some reason, then the correct fix could be to set current_thd for the duration of run_commit_ordered, which will satisfy the assertion and presumably the actual bug?
Setting current_thd is cheating, it will trick the assert, but won't fix the underlying problem as outlined above. Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org
Sergei Golubchik <serg@mariadb.org> writes:
Note that at this point, the group commit leader thread _is_ effectively acting as if it was each of the participating threads, the other threads are blocked from running during commit_ordered.
I'm not sure it's good enough. Practically it might work, but if one thread has modified some variable under a mutex, you cannot simply read it from another thread without a mutex, there's no guarantee that another thread will see the new value. You'll need an appropriate memory barrier for that.
Sure, I understand that, maybe I should have explained this code in more detail. This is the binlog group commit. We have N transactions doing commit at the same time: T1, T2, ..., TN. After writing to the binlog, we must run commit_ordered(T_i) for each transaction in order, i=1, 2, ..., N. The wait_for_prior_commit mechanism exists for this. Each thread will do: wait_for_prior_commit(T_{i-1}) commit_ordered(T_i) wakeup_subsequent_commits() But this is slow, since it requires thread scheduling between N threads, in-order: https://knielsen-hq.org/w/benchmarking-thread-scheduling-in-group-commit/ Instead, what we do is that T_1 will call commit_ordered() for _all_ the N transactions. Meanwhile the other threads are waiting to be signalled. Note that LOCK_wait_commit protects the hand-over from thread T_i to T_1 and back, so we _do_ have proper mutex protection. This mutex provides the necessary memory barrier between T_i modifying data and T_1 reading it, or T_1 modifying and T_i reading. Does that make it clear?
If there is some reason, then the correct fix could be to set current_thd for the duration of run_commit_ordered, which will satisfy the assertion and presumably the actual bug?
Setting current_thd is cheating, it will trick the assert, but won't fix the underlying problem as outlined above.
Nono, it's not cheating, it's in fact setting the correct value. While running commit_ordered(T_i), the current THD _is_ the THD for T_i, not the THD for T_1, which is unrelated to transaction T_i. It's already done for a similar case a bit earlier in the code, where T_1 is writing the binlog for each of the other threads: set_current_thd(current->thd); It's similar to the pool-of-threads. If a THD goes idle for some time, and then is resumed on a different thread in the thread pool, this does not mean that every access to the data must now be done under LOCK_thd_data. Rather, now the THD is owned by another thread, so this thread sets current_thd accordingly. The group commit code is similar, the THD is (temporarily) moved to be under a different thread. Does that make sense?
Monty asserted many times that taking an uncontended mutex is cheap. Did you think it's expensive or did you benchmark that?
I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course. But here, LOCK_thd_data serves no purpose, and it's done under a very critical global lock LOCK_commit_ordered. So surely we should not be taking/releasing N extra mutexes here without any reason to do so? For the same reason, set_current_thd() was not done in this code, to avoid it if it serves no purpose. This is documented on the commit_ordered method in sql/handler.h: "Note that commit_ordered() can be called from a different thread than the one handling the transaction! So it can not do anything that depends on thread local storage" That is why I asked if you knew what the real problem is that triggered this assertion. There is no reason to add code in a performance-critical place if it serves no purpose. Worse, it might hide the real bug if some code was added to commit_ordered that violates the above requirement. I think it is fine to add set_current_thd() *if it is actually needed*. But we should understand first *why* it is needed, other wise we may hide the real bug. - Kristian.
Hi, Kristian, On Oct 20, Kristian Nielsen wrote:
Sergei Golubchik <serg@mariadb.org> writes:
If there is some reason, then the correct fix could be to set current_thd for the duration of run_commit_ordered, which will satisfy the assertion and presumably the actual bug?
Setting current_thd is cheating, it will trick the assert, but won't fix the underlying problem as outlined above.
Nono, it's not cheating, it's in fact setting the correct value. While running commit_ordered(T_i), the current THD _is_ the THD for T_i, not the THD for T_1, which is unrelated to transaction T_i.
okay. but practically, is it cheaper than lock/unlock of an uncontended mutex? UPD: just looked at the assembly, and indeed, it is very cheap. It used to be expensive, but that was fixed years ago. Let's do it. Also, it made me reconsider the old rule of "avoid current_thd, it's expensive, pass THD as an argument". What to avoid, it's just Dump of assembler code for function _current_thd(): 0x000055780130bb21 <+0>: push %rbp 0x000055780130bb22 <+1>: mov %rsp,%rbp 0x000055780130bb25 <+4>: mov %fs:0x0,%rax 0x000055780130bb2e <+13>: lea -0x1c8(%rax),%rax 0x000055780130bb35 <+20>: mov (%rax),%rax 0x000055780130bb38 <+23>: pop %rbp 0x000055780130bb39 <+24>: ret (in a debug build, I don't have an optiimized build handy)
Monty asserted many times that taking an uncontended mutex is cheap. Did you think it's expensive or did you benchmark that?
I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course.
I'm sure it's an implementation detail. But I've just looked on my laptop and the mutex lock was lock cmpxchg so, it's a bus lock, indeed, not cheap.
For the same reason, set_current_thd() was not done in this code, to avoid it if it serves no purpose.
Indeed.
I think it is fine to add set_current_thd() *if it is actually needed*. But we should understand first *why* it is needed, otherwise we may hide the real bug.
That's as a I said, the assert simply verifies that the protocol is properly followed. So, set_current_thd() isn't really needed beyond that. But as it's cheap, let's do it. Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org
Hi Kristian and Sergei! I will walk through the code later. for now I have a few comments on Sergei's last reply: On Fri, 20 Oct 2023 at 22:35, Sergei Golubchik via developers < developers@lists.mariadb.org> wrote:
Also, it made me reconsider the old rule of "avoid current_thd, it's expensive, pass THD as an argument". What to avoid, it's just
Dump of assembler code for function _current_thd(): 0x000055780130bb21 <+0>: push %rbp 0x000055780130bb22 <+1>: mov %rsp,%rbp 0x000055780130bb25 <+4>: mov %fs:0x0,%rax 0x000055780130bb2e <+13>: lea -0x1c8(%rax),%rax 0x000055780130bb35 <+20>: mov (%rax),%rax 0x000055780130bb38 <+23>: pop %rbp 0x000055780130bb39 <+24>: ret
(in a debug build, I don't have an optiimized build handy)
Yes, Eugene Kosov has changed the implementation to use C++11's threadlocal instead of pthread_getspecific some years ago. So now it's at a cost of about 2 memory dispatches. If accessed from a shared object, it can be another one dispatch in elf, according to how I understood the abi, see the attached.
Monty asserted many times that taking an uncontended mutex is cheap.
Did you think it's expensive or did you benchmark that?
I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course.
I'm sure it's an implementation detail. But I've just looked on my laptop and the mutex lock was
lock cmpxchg
so, it's a bus lock, indeed, not cheap.
It's never a bus lock on the modern implementations, it rather specifies the operation to execute atomically. Quoting intel software developer's manual [link <https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-260.html> ]:
Locked operations are atomic with respect to all other memory operations and all externally visible events.
Normally it will be synchronized through MESI, unless the memory is not cache-line-aligned, or is marked as non-cacheable. https://stackoverflow.com/a/3339380 The price is still notable though Agner Fog <https://www.agner.org/optimize/instruction_tables.pdf> measures a lock cmpxchg latency as 20 cpu cycles on haswell. I did non find a direct answer for whether it is a contended case or not, i guess it's all measured in a single thread, so it should be an uncontended case. Also see https://www.uops.info/html-instr/CMPXCHG_LOCK_M64_R64.html with similar results. -- Yours truly, Nikita Malyavin
Hi, Nikita, On Oct 21, Nikita Malyavin wrote:
Did you think it's expensive or did you benchmark that? I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course.
I'm sure it's an implementation detail. But I've just looked on my laptop and the mutex lock was
lock cmpxchg
so, it's a bus lock, indeed, not cheap.
It's never a bus lock on the modern implementations, it rather specifies the operation to execute atomically.
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted. This is from the "IA-32 Intel® Architecture Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M", that can be downloaded from intel.com. Okay, it doesn't say "a bus lock" explicitly anymore.
Normally it will be synchronized through MESI, unless the memory is not cache-line-aligned, or is marked as non-cacheable.
These are two completely different issues. How changes are synchronized and how atomicity is implemented. Regards, Sergei Chief Architect, MariaDB Server and security@mariadb.org
Sergei, On Sat, 21 Oct 2023 at 14:56, Sergei Golubchik <serg@mariadb.org> wrote:
Hi, Nikita,
On Oct 21, Nikita Malyavin wrote:
Did you think it's expensive or did you benchmark that? I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course.
I'm sure it's an implementation detail. But I've just looked on my laptop and the mutex lock was
lock cmpxchg
so, it's a bus lock, indeed, not cheap.
It's never a bus lock on the modern implementations, it rather specifies the operation to execute atomically.
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal insures that the processor has exclusive use of any shared memory while the signal is asserted.
This is from the "IA-32 Intel® Architecture Software Developer’s Manual Volume 2A: Instruction Set Reference, A-M", that can be downloaded from intel.com.
Volume 3A: System Programming Guide, Part 1 <https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf>
8.1.4 Effects of a LOCK Operation on Internal Processor Caches For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation, even if the area of memory being locked is cached in the processor.
For the P6 and more recent processor families, if the area of memory being
locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.
-- Yours truly, Nikita Malyavin
participants (3)
-
Kristian Nielsen
-
Nikita Malyavin
-
Sergei Golubchik