Hi Kristian and Sergei! I will walk through the code later. for now I have a few comments on Sergei's last reply: On Fri, 20 Oct 2023 at 22:35, Sergei Golubchik via developers < developers@lists.mariadb.org> wrote:
Also, it made me reconsider the old rule of "avoid current_thd, it's expensive, pass THD as an argument". What to avoid, it's just
Dump of assembler code for function _current_thd(): 0x000055780130bb21 <+0>: push %rbp 0x000055780130bb22 <+1>: mov %rsp,%rbp 0x000055780130bb25 <+4>: mov %fs:0x0,%rax 0x000055780130bb2e <+13>: lea -0x1c8(%rax),%rax 0x000055780130bb35 <+20>: mov (%rax),%rax 0x000055780130bb38 <+23>: pop %rbp 0x000055780130bb39 <+24>: ret
(in a debug build, I don't have an optiimized build handy)
Yes, Eugene Kosov has changed the implementation to use C++11's threadlocal instead of pthread_getspecific some years ago. So now it's at a cost of about 2 memory dispatches. If accessed from a shared object, it can be another one dispatch in elf, according to how I understood the abi, see the attached.
Monty asserted many times that taking an uncontended mutex is cheap.
Did you think it's expensive or did you benchmark that?
I did, actually! (in the past). Taking a mutex needs to reserve the associated cache line, which is not free. "Expensive" is relative, of course.
I'm sure it's an implementation detail. But I've just looked on my laptop and the mutex lock was
lock cmpxchg
so, it's a bus lock, indeed, not cheap.
It's never a bus lock on the modern implementations, it rather specifies the operation to execute atomically. Quoting intel software developer's manual [link <https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-260.html> ]:
Locked operations are atomic with respect to all other memory operations and all externally visible events.
Normally it will be synchronized through MESI, unless the memory is not cache-line-aligned, or is marked as non-cacheable. https://stackoverflow.com/a/3339380 The price is still notable though Agner Fog <https://www.agner.org/optimize/instruction_tables.pdf> measures a lock cmpxchg latency as 20 cpu cycles on haswell. I did non find a direct answer for whether it is a contended case or not, i guess it's all measured in a single thread, so it should be an uncontended case. Also see https://www.uops.info/html-instr/CMPXCHG_LOCK_M64_R64.html with similar results. -- Yours truly, Nikita Malyavin