Laurynas Biveinis <laurynas.biveinis@gmail.com> writes:
We have another platform-specific addition: thread-local storage implemented by __thread GCC keyword, which is GNU specific. This is
Hm, this is interesting, I was not aware of the __thread feature. This looks to be much more efficient than pthread_getspecific(), which is one of the top functions in the profiles for sysbench readonly: static __thread long tcount = 0; long bump(long del) { return (tcount += del); } 0000000000000000 <bump>: 0: 48 89 f8 mov %rdi,%rax 3: 64 48 03 04 25 00 00 add %fs:0x0,%rax a: 00 00 c: 64 48 89 04 25 00 00 mov %rax,%fs:0x0 13: 00 00 15: c3 retq It is just a single instruction using %fs: addressing to access the thread_local storage. Apparently the linker resolves the %fs: offset in the resulting binary: 0000000000400580 <bump>: 400580: 48 89 f8 mov %rdi,%rax 400583: 64 48 03 04 25 e0 ff add %fs:0xffffffffffffffe0,%rax 40058a: ff ff 40058c: 64 48 89 04 25 e0 ff mov %rax,%fs:0xffffffffffffffe0 400593: ff ff 400595: c3 retq 00000000004005a0 <bump2>: 4005a0: 48 89 f8 mov %rdi,%rax 4005a3: 64 48 03 04 25 f0 ff add %fs:0xfffffffffffffff0,%rax 4005aa: ff ff 4005ac: 64 48 89 04 25 f0 ff mov %rax,%fs:0xfffffffffffffff0 4005b3: ff ff 4005b5: c3 retq So that could be very promising for an easy way to tweak out a little more performance. Something to look into at some point. - Kristian.