I will walk through the code later. for now I have a few comments on Sergei's last reply:
Also, it made me reconsider the old rule of "avoid current_thd, it's
expensive, pass THD as an argument". What to avoid, it's just
Dump of assembler code for function _current_thd():
0x000055780130bb21 <+0>: push %rbp
0x000055780130bb22 <+1>: mov %rsp,%rbp
0x000055780130bb25 <+4>: mov %fs:0x0,%rax
0x000055780130bb2e <+13>: lea -0x1c8(%rax),%rax
0x000055780130bb35 <+20>: mov (%rax),%rax
0x000055780130bb38 <+23>: pop %rbp
0x000055780130bb39 <+24>: ret
(in a debug build, I don't have an optiimized build handy)
Yes, Eugene Kosov has changed the implementation to use C++11's threadlocal instead of pthread_getspecific some years ago.
So now it's at a cost of about 2 memory dispatches. If accessed from a shared object, it can be another one dispatch in elf, according to how I understood the abi, see the attached.
> > Monty asserted many times that taking an uncontended mutex is cheap.
> > Did you think it's expensive or did you benchmark that?
>
> I did, actually! (in the past). Taking a mutex needs to reserve the
> associated cache line, which is not free. "Expensive" is relative, of
> course.
I'm sure it's an implementation detail.
But I've just looked on my laptop and the mutex lock was
lock cmpxchg
so, it's a bus lock, indeed, not cheap.
It's never a bus lock on the modern implementations, it rather specifies the operation to execute atomically.
Quoting intel software developer's manual [
link]:
Locked operations are atomic with respect to all other memory operations and all externally visible events.
Normally it will be synchronized through MESI, unless the memory is not cache-line-aligned, or is marked as non-cacheable.
The price is still notable though
Agner Fog measures a lock cmpxchg latency as 20 cpu cycles on haswell. I did non find a direct answer for whether it is a contended case or not, i guess it's all measured in a single thread, so it should be an uncontended case.