Hi Kristian and Sergei!

I will walk through the code later. for now I have a few comments on Sergei's last reply:

On Fri, 20 Oct 2023 at 22:35, Sergei Golubchik via developers <developers@lists.mariadb.org> wrote:
Also, it made me reconsider the old rule of "avoid current_thd, it's
expensive, pass THD as an argument". What to avoid, it's just

Dump of assembler code for function _current_thd():
   0x000055780130bb21 <+0>:     push   %rbp
   0x000055780130bb22 <+1>:     mov    %rsp,%rbp
   0x000055780130bb25 <+4>:     mov    %fs:0x0,%rax
   0x000055780130bb2e <+13>:    lea    -0x1c8(%rax),%rax
   0x000055780130bb35 <+20>:    mov    (%rax),%rax
   0x000055780130bb38 <+23>:    pop    %rbp
   0x000055780130bb39 <+24>:    ret

(in a debug build, I don't have an optiimized build handy)

Yes, Eugene Kosov has changed the implementation to use C++11's threadlocal instead of pthread_getspecific some years ago.
So now it's at a cost of about 2 memory dispatches. If accessed from a shared object, it can be another one dispatch in elf, according to how I understood the abi, see the attached.

> > Monty asserted many times that taking an uncontended mutex is cheap.
> > Did you think it's expensive or did you benchmark that?
>
> I did, actually! (in the past). Taking a mutex needs to reserve the
> associated cache line, which is not free. "Expensive" is relative, of
> course.

I'm sure it's an implementation detail.
But I've just looked on my laptop and the mutex lock was

  lock cmpxchg

so, it's a bus lock, indeed, not cheap.


It's never a bus lock on the modern implementations, it rather specifies the operation to execute atomically.
Quoting intel software developer's manual [link]:
Locked operations are atomic with respect to all other memory operations and all externally visible events. 

Normally it will be synchronized through MESI, unless the memory is not cache-line-aligned, or is marked as non-cacheable.

https://stackoverflow.com/a/3339380

The price is still notable though Agner Fog measures a lock cmpxchg latency as 20 cpu cycles on haswell. I did non find a direct answer for whether it is a contended case or not, i guess it's all measured in a single thread, so it should be an uncontended case.

Also see https://www.uops.info/html-instr/CMPXCHG_LOCK_M64_R64.html with similar results.


--
Yours truly,
Nikita Malyavin