Re: [Maria-discuss] Some questions about the Aarch64 CI

26 Nov 2019

      On Tue, 26 Nov 2019 10:56:41 +0200
Marko Mäkelä <marko.makela@mariadb.com> wrote:
...
Hi Daniel,
On Tue, Nov 26, 2019 at 2:02 AM Daniel Black <daniel@linux.ibm.com> wrote:
...
On Mon, 25 Nov 2019 11:32:07 +0200
Marko Mäkelä <marko.makela@mariadb.com> wrote:
...
I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
...
and I seem
to remember that the latest version of the SPARC introduced support
for that as well. (IA-32 and AMD64 have always supported unaligned
access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if
include/byte_order_generic.h was rewritten in a suitable way. Ideally,
include/byte_order_generic_x86_64.h would be replaced with a portable
version of both, and compilers could simply perform the optimizations.
I have been told that replacing the + in the macros with | could
already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
I pushed my micro-optimization to 10.5:
https://github.com/MariaDB/server/commit/25e2a556de2e125784d52a0c7ccda4fa659...
If there really is no compiler flag that would allow any memcpy(),
memset(),
memcmp()
Well, actually:

-fno-builtin-mem{cmp,set,cpy}

-mmem{set,cpy}-strategy= (seems x86 only)
...
of 2,4,8 bytes to be translated into simple
(possibly unaligned) multi-byte instructions,
Actually gcc has put an effort into getting the optimum implementation here already. It doesn't look like a thing an end application should be trying to optimise.

$ rm -f  memset_opt.o &&  gcc -O1 -fomit-frame-pointer -c   memset_opt.c -o  memset_opt.o && objdump -d memset_opt.o | grep -A 10 vmem
0000000000000000 <vmemset>:
   0:	c7 07 00 00 00 00    	movl   $0x0,(%rdi)
   6:	c3                   	retq   

0000000000000007 <vmemcmp>:
   7:	48 83 ec 18          	sub    $0x18,%rsp
   b:	89 7c 24 0c          	mov    %edi,0xc(%rsp)
   f:	ba 04 00 00 00       	mov    $0x4,%edx
  14:	48 8d 74 24 0c       	lea    0xc(%rsp),%rsi
  19:	bf 00 00 00 00       	mov    $0x0,%edi
  1e:	e8 00 00 00 00       	callq  23 <vmemcmp+0x1c>
  23:	48 83 c4 18          	add    $0x18,%rsp
  27:	c3                   	retq   

0000000000000028 <vmemstatic>:
  28:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
  2d:	c3                   	retq   

000000000000002e <vmemcpy>:
  2e:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax        # 34 <vmemcpy+0x6>
  34:	89 07                	mov    %eax,(%rdi)
  36:	c3                   	retq   
[dan@volution junk]$ cat memset_opt.c

#include <string.h>

static int comp = 7;

char r[30];

void vmemset(char v[30])
{
    memset(v, 0, 4);
}

int vmemcmp(int c)
{
   return memcmp(&comp, &c, sizeof(c));
}

int vmemstatic()
{
   return memcmp("cat", "dog", 3);
}

void vmemcpy(int *c)
{
   memcpy(c, r, sizeof(*c));
}

Not sure why vmemcmp still has a memcpy call, but by vmemstatic some understanding is there.

script to test:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052#c12

Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (laptop)
glibc-2.29-22.fc30
gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)

$ sh test_stringop 64 640000000 gcc -march=native
memcpy mode:64 size:640000000
                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte profiled dynamic
block size 8192000 0:00.12 0:00.12 0:00.12 0:00.13 0:00.12 0:00.12 0:00.13 0:00.13 0:00.13 0:00.13 0:00.13 0:00.50 0:00.12 0:00.12 best: 0:00.12 libcall
block size  819200 0:00.08 0:00.10 0:00.10 0:00.10 0:00.10 0:00.10 0:00.10 0:00.09 0:00.09 0:00.09 0:00.09 0:00.48 0:00.08 0:00.08 best: 0:00.08 libcall
block size   81920 0:00.04 0:00.05 0:00.05 0:00.05 0:00.04 0:00.04 0:00.05 0:00.09 0:00.08 0:00.07 0:00.08 0:00.51 0:00.04 0:00.04 best: 0:00.04 libcall
block size   20480 0:00.04 0:00.04 0:00.04 0:00.04 0:00.04 0:00.04 0:00.04 0:00.07 0:00.11 0:00.08 0:00.08 0:00.86 0:00.03 0:00.04 best: 0:00.04 libcall
block size    8192 0:00.03 0:00.03 0:00.04 0:00.04 0:00.03 0:00.03 0:00.03 0:00.06 0:00.10 0:00.06 0:00.07 0:00.48 0:00.03 0:00.03 best: 0:00.03 libcall
block size    4096 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.06 0:00.10 0:00.06 0:00.07 0:00.47 0:00.03 0:00.03 best: 0:00.03 libcall
block size    2048 0:00.03 0:00.03 0:00.03 0:00.04 0:00.03 0:00.04 0:00.03 0:00.07 0:00.10 0:00.06 0:00.07 0:00.48 0:00.03 0:00.03 best: 0:00.03 libcall
block size    1024 0:00.04 0:00.04 0:00.04 0:00.05 0:00.04 0:00.05 0:00.04 0:00.08 0:00.11 0:00.07 0:00.07 0:00.49 0:00.03 0:00.04 best: 0:00.04 libcall
block size     512 0:00.05 0:00.06 0:00.06 0:00.06 0:00.05 0:00.06 0:00.05 0:00.09 0:00.12 0:00.07 0:00.07 0:00.50 0:00.09 0:00.06 best: 0:00.05 libcall
block size     256 0:00.07 0:00.08 0:00.08 0:00.09 0:00.08 0:00.09 0:00.08 0:00.10 0:00.12 0:00.09 0:00.09 0:00.52 0:00.10 0:00.09 best: 0:00.07 libcall
block size     128 0:00.11 0:00.13 0:00.13 0:00.15 0:00.13 0:00.15 0:00.13 0:00.14 0:00.14 0:00.12 0:00.11 0:00.56 0:00.12 0:00.12 best: 0:00.11 libcall
block size      64 0:00.20 0:00.20 0:00.20 0:00.24 0:00.22 0:00.24 0:00.22 0:00.19 0:00.20 0:00.18 0:00.19 0:00.75 0:00.18 0:00.18 best: 0:00.18 unrl
block size      48 0:00.25 0:00.28 0:00.28 0:00.31 0:00.31 0:00.31 0:00.29 0:00.23 0:00.22 0:00.22 0:00.23 0:00.66 0:00.22 0:00.22 best: 0:00.22 loopnoalign
block size      32 0:00.38 0:00.40 0:00.38 0:00.44 0:00.39 0:00.45 0:00.38 0:00.30 0:00.32 0:00.30 0:00.31 0:00.93 0:00.31 0:00.31 best: 0:00.30 loop
block size      24 0:00.51 0:00.57 0:00.56 0:00.63 0:00.58 0:00.64 0:00.56 0:00.42 0:00.40 0:00.36 0:00.37 0:00.78 0:00.37 0:00.36 best: 0:00.36 unrl
block size      16 0:00.75 0:00.74 0:00.74 0:00.84 0:00.74 0:00.85 0:00.70 0:00.48 0:00.47 0:00.40 0:00.40 0:00.87 0:00.48 0:00.47 best: 0:00.40 unrl
block size      14 0:00.76 0:00.95 0:00.97 0:01.01 0:00.99 0:01.01 0:00.89 0:00.52 0:00.51 0:00.49 0:00.49 0:00.88 0:00.54 0:00.53 best: 0:00.49 unrl
block size      12 0:00.93 0:01.10 0:01.10 0:01.14 0:01.05 0:01.19 0:00.98 0:00.64 0:00.61 0:00.56 0:00.57 0:00.79 0:00.64 0:00.59 best: 0:00.56 unrl
block size      10 0:01.04 0:01.31 0:01.31 0:01.37 0:01.23 0:01.41 0:01.14 0:00.75 0:00.74 0:00.68 0:00.65 0:00.86 0:00.68 0:00.69 best: 0:00.65 unrlnoalign
block size       8 0:01.36 0:01.59 0:01.55 0:01.68 0:01.37 0:01.64 0:01.18 0:00.79 0:00.79 0:00.73 0:00.73 0:00.89 0:00.81 0:00.78 best: 0:00.73 unrl
block size       6 0:01.66 0:02.25 0:02.23 0:02.31 0:02.01 0:02.31 0:01.57 0:01.01 0:00.96 0:00.99 0:01.01 0:01.02 0:01.00 0:01.01 best: 0:00.96 loopnoalign
block size       4 0:02.68 0:03.24 0:03.47 0:03.21 0:02.65 0:01.38 0:01.36 0:01.41 0:01.34 0:01.38 0:01.35 0:01.47 0:01.26 0:01.32 best: 0:01.34 loopnoalign
block size       1 0:05.41 0:17.59 0:17.41 0:01.52 0:01.51 0:01.46 0:01.50 0:01.59 0:01.49 0:01.56 0:01.52 0:02.43 0:02.39 0:02.42 best: 0:01.46 rep8
memset
                   libcall   rep1   noalg    rep4   noalg    rep8   noalg    loop   noalg    unrl   noalg    byte profiled dynamic
block size 8192000 0:00.05 0:00.05 0:00.06 0:00.05 0:00.05 0:00.05 0:00.05 0:00.11 0:00.09 0:00.09 0:00.11 0:00.47 0:00.05 0:00.05 best: 0:00.05 libcall
block size  819200 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.08 0:00.07 0:00.08 0:00.45 0:00.06 0:00.05 best: 0:00.05 libcall
block size   81920 0:00.03 0:00.03 0:00.03 0:00.04 0:00.04 0:00.03 0:00.03 0:00.10 0:00.07 0:00.06 0:00.07 0:00.47 0:00.04 0:00.03 best: 0:00.03 libcall
block size   20480 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.08 0:00.11 0:00.07 0:00.08 0:00.51 0:00.03 0:00.03 best: 0:00.03 libcall
block size    8192 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.09 0:00.06 0:00.06 0:00.06 0:00.45 0:00.03 0:00.03 best: 0:00.03 libcall
block size    4096 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.04 0:00.10 0:00.06 0:00.06 0:00.06 0:00.41 0:00.03 0:00.03 best: 0:00.03 libcall
block size    2048 0:00.04 0:00.03 0:00.04 0:00.04 0:00.03 0:00.04 0:00.03 0:00.10 0:00.06 0:00.06 0:00.06 0:00.41 0:00.04 0:00.04 best: 0:00.03 rep1
block size    1024 0:00.05 0:00.04 0:00.05 0:00.05 0:00.04 0:00.05 0:00.04 0:00.10 0:00.07 0:00.07 0:00.07 0:00.41 0:00.05 0:00.05 best: 0:00.04 rep1
block size     512 0:00.07 0:00.06 0:00.06 0:00.07 0:00.06 0:00.07 0:00.06 0:00.11 0:00.07 0:00.08 0:00.07 0:00.42 0:00.07 0:00.07 best: 0:00.06 rep1
block size     256 0:00.11 0:00.08 0:00.08 0:00.10 0:00.08 0:00.10 0:00.08 0:00.12 0:00.09 0:00.09 0:00.09 0:00.44 0:00.10 0:00.11 best: 0:00.08 rep1
block size     128 0:00.15 0:00.13 0:00.13 0:00.15 0:00.13 0:00.15 0:00.12 0:00.15 0:00.12 0:00.13 0:00.13 0:00.50 0:00.14 0:00.14 best: 0:00.12 loopnoalign
block size      64 0:00.28 0:00.21 0:00.22 0:00.25 0:00.23 0:00.23 0:00.22 0:00.20 0:00.20 0:00.20 0:00.21 0:00.50 0:00.20 0:00.20 best: 0:00.20 loop
block size      48 0:00.31 0:00.27 0:00.27 0:00.30 0:00.30 0:00.29 0:00.28 0:00.24 0:00.23 0:00.24 0:00.25 0:00.59 0:00.24 0:00.24 best: 0:00.23 loopnoalign
block size      32 0:00.47 0:00.36 0:00.36 0:00.40 0:00.37 0:00.40 0:00.37 0:00.30 0:00.31 0:00.31 0:00.30 0:00.58 0:00.31 0:00.31 best: 0:00.30 loop
block size      24 0:00.62 0:00.55 0:00.55 0:00.59 0:00.56 0:00.55 0:00.52 0:00.35 0:00.35 0:00.35 0:00.36 0:00.66 0:00.35 0:00.34 best: 0:00.35 loop
block size      16 0:00.92 0:00.78 0:00.72 0:00.76 0:00.70 0:00.71 0:00.63 0:00.40 0:00.40 0:00.33 0:00.34 0:00.67 0:00.39 0:00.40 best: 0:00.33 unrl
block size      14 0:00.98 0:00.94 0:00.95 0:00.95 0:00.91 0:00.90 0:00.85 0:00.43 0:00.43 0:00.39 0:00.39 0:00.68 0:00.43 0:00.43 best: 0:00.39 unrl
block size      12 0:01.16 0:01.11 0:01.10 0:01.09 0:01.03 0:01.01 0:00.87 0:00.43 0:00.46 0:00.44 0:00.43 0:00.72 0:00.46 0:00.45 best: 0:00.43 loop
block size      10 0:01.39 0:01.33 0:01.33 0:01.29 0:01.21 0:01.17 0:00.99 0:00.49 0:00.51 0:00.50 0:00.55 0:00.84 0:00.58 0:00.57 best: 0:00.49 loop
block size       8 0:01.87 0:01.51 0:01.47 0:01.43 0:01.26 0:01.27 0:00.96 0:00.57 0:00.56 0:00.52 0:00.51 0:00.83 0:00.56 0:00.55 best: 0:00.51 unrlnoalign
block size       6 0:02.17 0:02.26 0:02.29 0:01.99 0:01.80 0:01.56 0:01.27 0:00.70 0:00.70 0:00.74 0:00.72 0:00.92 0:00.74 0:00.71 best: 0:00.70 loop
block size       4 0:03.16 0:03.16 0:03.11 0:02.47 0:02.04 0:01.02 0:00.95 0:00.92 0:00.93 0:00.91 0:00.93 0:01.09 0:00.93 0:01.08 best: 0:00.91 unrl
block size       1 0:04.64 0:17.11 0:18.85 0:01.78 0:01.79 0:01.77 0:01.76 0:01.74 0:01.79 0:01.70 0:01.68 0:02.05 0:01.27 0:02.27 best: 0:01.68 unrlnoalign

For non-x86 I modified the above script (at attached) to run the memX and compare to it with {-fno-builtin-X}

root@ozrom2:~# sh test_stringop 64 640000000 gcc -mcpu=power9  | tee out.txt
root@ozrom2:~# gcc --version
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

memcpy mode:64 size:640000000
                   libcall   nobuiltin   byte profiled 
block size 8192000 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall
block size  819200 0:00.04 0:00.03 0:00.04 best: 0:00.03 nobuiltin
block size   81920 0:00.03 0:00.03 0:00.03 best: 0:00.03 libcall
block size   20480 0:00.03 0:00.03 0:00.04 best: 0:00.03 libcall
block size    8192 0:00.03 0:00.04 0:00.04 best: 0:00.03 libcall
block size    4096 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall
block size    2048 0:00.05 0:00.05 0:00.05 best: 0:00.05 libcall
block size    1024 0:00.07 0:00.07 0:00.06 best: 0:00.07 libcall
block size     512 0:00.09 0:00.09 0:00.10 best: 0:00.09 libcall
block size     256 0:00.12 0:00.12 0:00.11 best: 0:00.12 libcall
block size     128 0:00.19 0:00.20 0:00.19 best: 0:00.19 libcall
block size      64 0:00.32 0:00.32 0:00.31 best: 0:00.32 libcall
block size      48 0:00.45 0:00.44 0:00.46 best: 0:00.44 nobuiltin
block size      32 0:00.58 0:00.59 0:00.58 best: 0:00.58 libcall
block size      24 0:00.82 0:00.83 0:00.81 best: 0:00.82 libcall
block size      16 0:01.09 0:01.10 0:01.05 best: 0:01.09 libcall
block size      14 0:01.36 0:01.36 0:01.32 best: 0:01.36 libcall
block size      12 0:01.58 0:01.57 0:01.56 best: 0:01.57 nobuiltin
block size      10 0:01.88 0:01.88 0:01.84 best: 0:01.88 libcall
block size       8 0:02.14 0:02.14 0:02.03 best: 0:02.14 libcall
block size       6 0:03.24 0:03.24 0:03.01 best: 0:03.24 libcall
block size       4 0:04.27 0:04.26 0:03.90 best: 0:04.26 nobuiltin
block size       1 0:18.42 0:18.45 0:15.29 best: 0:18.42 libcall
memset
                   libcall   nobuiltin   byte profiled 
block size 8192000 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall
block size  819200 0:00.04 0:00.03 0:00.04 best: 0:00.03 nobuiltin
block size   81920 0:00.04 0:00.03 0:00.03 best: 0:00.03 nobuiltin
block size   20480 0:00.04 0:00.04 0:00.03 best: 0:00.04 libcall
block size    8192 0:00.03 0:00.04 0:00.03 best: 0:00.03 libcall
block size    4096 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall
block size    2048 0:00.05 0:00.05 0:00.05 best: 0:00.05 libcall
block size    1024 0:00.07 0:00.07 0:00.06 best: 0:00.07 libcall
block size     512 0:00.09 0:00.09 0:00.10 best: 0:00.09 libcall
block size     256 0:00.13 0:00.12 0:00.11 best: 0:00.12 nobuiltin
block size     128 0:00.19 0:00.18 0:00.19 best: 0:00.18 nobuiltin
block size      64 0:00.31 0:00.32 0:00.31 best: 0:00.31 libcall
block size      48 0:00.44 0:00.45 0:00.45 best: 0:00.44 libcall
block size      32 0:00.58 0:00.58 0:00.58 best: 0:00.58 libcall
block size      24 0:00.82 0:00.82 0:00.81 best: 0:00.82 libcall
block size      16 0:01.09 0:01.09 0:01.05 best: 0:01.09 libcall
block size      14 0:01.36 0:01.36 0:01.32 best: 0:01.36 libcall
block size      12 0:01.57 0:01.58 0:01.55 best: 0:01.57 libcall
block size      10 0:01.90 0:01.90 0:01.83 best: 0:01.90 libcall
block size       8 0:02.14 0:02.15 0:02.05 best: 0:02.14 libcall
block size       6 0:03.20 0:03.20 0:03.03 best: 0:03.20 libcall
block size       4 0:04.26 0:04.27 0:03.91 best: 0:04.26 libcall
block size       1 0:18.43 0:18.44 0:15.30 best: 0:18.43 libcall

So its pretty much better or identical to use memset/cmp in all cases the ones showing up as nobuiltin are pretty much in the noise of measurement.
...
then we might add
further MY_ASSUME_ALIGNED() assertions here and there, to allow gcc
and clang to generate better code for POWER and ARM.
If the compiler is smart enough, it might suffice to implement an
accessor for buf_block_t or buf_block_t::frame that would
MY_ASSUME_ALIGNED(frame, 4096). Then the compiler might correctly
infer the alignment of (block->frame + some_compile_time_constant) and
enable the optimization. I would be unwilling to pepper such hints all
over the code.
Marko