[Maria-discuss] Some questions about the Aarch64 CI
Hi MariaDB, Thanks for the greate unconference in Shanghai this week, it is a really useful event for MariaDB newbie like me. We have mentioned that we are willing to donate some ARM resource to the fundation for ARM testing and other purpose in the future. And I have done donated one for POC on 20th Nov, and I can now see a new builder(worker) on: https://buildbot.mariadb.org/#/builders named as ``aarch-fedora-30``, and the version is 2019.11.20, so I guess this could be running on server that I've donated? And seems it is successfully running for few rounds. So I have some questions about the Aarch64 CI: 1. So the jobs are running through docker, so jobs for different OS could be run on this host, right? Is it possible to also enable a CentOS7 job now? 2. As I mentioned in the unconference, we will have an OpenSource OS released soon, and we are looking for possiblities to also make it tested in the upstream, If the answer to the first question is True, we will have to prepare a base docker image for our OS, right? 3. I saw that there is a sponsors site for buildbot: https://buildbot.mariadb.org/#/sponsor , are we able to be on that page too? BR, Kevin Zheng
1. So the jobs are running through docker, so jobs for different OS could be run on this host, right? Is it possible to also enable a CentOS7 job now?
Yes, you are right. We use docker to run the jobs and we can add more builders with different OSs.The new CentOS7 builder is up and running https://buildbot.mariadb.org/#/builders/33.
1. As I mentioned in the unconference, we will have an OpenSource OS released soon, and we are looking for possiblities to also make it tested in the upstream, If the answer to the first question is True, we will have to prepare a base docker image for our OS, right?
The easiest setup would be to have a docker image of the OS. However, if
Hi Kevin, First of all thank you very much for your support and graceful donation! You assumed correctly and the ``aarch-fedora-30`` builder is running on the new machine that you donated. However, I renamed the builder at this point, so the new one is ``aarch64-fedora-30``. Also, I have added a new CentOS7 builder, namely ``aarch64-centos-7``. You can see both of them on the builders page https://buildbot.mariadb.org/#/builders. Now, to answer your question that is not possible, we can discuss and come up with different other potential solutions. So, let us know when the OS is released.
1. I saw that there is a sponsors site for buildbot: https://buildbot.mariadb.org/#/sponsor , are we able to be on that page too?
I have updated the sponsors page. However, if you have any suggestions or
other requests regarding the sponsor page, let us know so that we can update it accordingly. Cheers, Vlad
Hi Vlad, Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs. BR, ________________________________ 发件人: Vlad Bogolin <vlad@mariadb.org> 发送时间: 2019年11月22日 17:00 收件人: Zheng Zhenyu <zheng.zhenyu@outlook.com> 抄送: maria-discuss@lists.launchpad.net <maria-discuss@lists.launchpad.net> 主题: Re: [Maria-discuss] Some questions about the Aarch64 CI Hi Kevin, First of all thank you very much for your support and graceful donation! You assumed correctly and the ``aarch-fedora-30`` builder is running on the new machine that you donated. However, I renamed the builder at this point, so the new one is ``aarch64-fedora-30``. Also, I have added a new CentOS7 builder, namely ``aarch64-centos-7``. You can see both of them on the builders page https://buildbot.mariadb.org/#/builders. Now, to answer your question 1. So the jobs are running through docker, so jobs for different OS could be run on this host, right? Is it possible to also enable a CentOS7 job now? Yes, you are right. We use docker to run the jobs and we can add more builders with different OSs.The new CentOS7 builder is up and running https://buildbot.mariadb.org/#/builders/33. 1. As I mentioned in the unconference, we will have an OpenSource OS released soon, and we are looking for possiblities to also make it tested in the upstream, If the answer to the first question is True, we will have to prepare a base docker image for our OS, right? The easiest setup would be to have a docker image of the OS. However, if that is not possible, we can discuss and come up with different other potential solutions. So, let us know when the OS is released. 1. I saw that there is a sponsors site for buildbot: https://buildbot.mariadb.org/#/sponsor , are we able to be on that page too? I have updated the sponsors page. However, if you have any suggestions or other requests regarding the sponsor page, let us know so that we can update it accordingly. Cheers, Vlad
Hi Kevin, On Mon, Nov 25, 2019 at 8:23 AM Zheng Zhenyu <zheng.zhenyu@outlook.com> wrote:
Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
Related to this, I wonder if it would be possible to install a newer operating system (or Docker image), such as CentOS 8 or Debian 10 or the most recent Fedora. What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler: /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t _ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report). I remember seeing that kind of an error for some 64-bit atomic operation on a very old GCC targeting x86 (on CentOS 5 maybe?). While we have older compilers than GCC 4.8.5 for other instruction set architectures, I do not think that we run into internal compiler errors very often. On my AMD64 desktop, I am currently using GCC 9.2.1 and clang 9.0.0. As a developer, I prefer to have the most recent versions of tools whenever it is possible, for better diagnostics and possibly better optimizations. Best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
I thought EL7 is very much supported with aarch64. If that is the case won't what you are suggesting effectively abandon EL7? On Tue, Dec 10, 2019 at 9:25 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
Hi Kevin,
On Mon, Nov 25, 2019 at 8:23 AM Zheng Zhenyu <zheng.zhenyu@outlook.com> wrote:
Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
Related to this, I wonder if it would be possible to install a newer operating system (or Docker image), such as CentOS 8 or Debian 10 or the most recent Fedora.
What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler:
/buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t
_ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn
More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report).
I remember seeing that kind of an error for some 64-bit atomic operation on a very old GCC targeting x86 (on CentOS 5 maybe?). While we have older compilers than GCC 4.8.5 for other instruction set architectures, I do not think that we run into internal compiler errors very often.
On my AMD64 desktop, I am currently using GCC 9.2.1 and clang 9.0.0. As a developer, I prefer to have the most recent versions of tools whenever it is possible, for better diagnostics and possibly better optimizations.
Best regards,
Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On Tue, 10 Dec 2019 21:54:37 +0000 Gordan Bobic <gordan.bobic@gmail.com> wrote:
I thought EL7 is very much supported with aarch64. If that is the case won't what you are suggesting effectively abandon EL7?
I suggest that to help Red Hat get the ARM support you: a) search https://bugzilla.redhat.com/ for anything similar b) reproduce on a RHEL EL7 for arm. use: make VERBOSE=1 to extract the exact command line (including the change directory) add -save-temps to the c++ flags Include the generated .ic preprocessed file in the bug report/support ticket.
On Tue, Dec 10, 2019 at 9:25 PM Marko Mäkelä <marko.makela@mariadb.com> wrote:
What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler:
/buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t
_ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn
More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report).
Hi Marko, First off, sorry for the delay of reply of your previous question, I was on travel and I saw few people already replied so I didn't do it. As for your suggestion, yes, it will be no problem, I think Vlad take cares of the Docker images. The resource are donated to the community, and the community can do whatever is required. Thanks, Kevin Zheng ________________________________ 发件人: Marko Mäkelä <marko.makela@mariadb.com> 发送时间: 2019年12月10日 21:25 收件人: Zheng Zhenyu <zheng.zhenyu@outlook.com> 抄送: Vlad Bogolin <vlad@mariadb.org>; maria-discuss@lists.launchpad.net <maria-discuss@lists.launchpad.net> 主题: Re: [Maria-discuss] 回复: Some questions about the Aarch64 CI Hi Kevin, On Mon, Nov 25, 2019 at 8:23 AM Zheng Zhenyu <zheng.zhenyu@outlook.com> wrote:
Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
Related to this, I wonder if it would be possible to install a newer operating system (or Docker image), such as CentOS 8 or Debian 10 or the most recent Fedora. What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler: /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t _ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report). I remember seeing that kind of an error for some 64-bit atomic operation on a very old GCC targeting x86 (on CentOS 5 maybe?). While we have older compilers than GCC 4.8.5 for other instruction set architectures, I do not think that we run into internal compiler errors very often. On my AMD64 desktop, I am currently using GCC 9.2.1 and clang 9.0.0. As a developer, I prefer to have the most recent versions of tools whenever it is possible, for better diagnostics and possibly better optimizations. Best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Hi Kevin, On Wed, Dec 11, 2019 at 3:45 AM Zheng Zhenyu <zheng.zhenyu@outlook.com> wrote:
Hi Marko,
First off, sorry for the delay of reply of your previous question, I was on travel and I saw few people already replied so I didn't do it.
As for your suggestion, yes, it will be no problem, I think Vlad take cares of the Docker images. The resource are donated to the community, and the community can do whatever is required.
Thank you. I have asked Vlad to diagnose this problem. I am not completely aware how the Buildbot integration works and who is responsible for what, but now I have a slightly better idea of it. Gordan Bobic: I apologize for my brain-fart. Obviously we cannot simply abandon less widely used platforms for stable release branches. If we did that, GNU/Linux distributions could hit the same problems and could be unable to update their packages for some architectures. My personal preference would be to *additionally* have some bleeding-edge compilers and environments running on the continuous integration environment. -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Hi, I finally have some updates regarding the internal compiler bug that Marko reported. I have submitted a bug request https://bugzilla.redhat.com/show_bug.cgi?id=1788104 but it seems that the bug won't be fixed. Cheers, Vlad On Wed, Dec 11, 2019 at 8:40 AM Marko Mäkelä <marko.makela@mariadb.com> wrote:
Hi Kevin,
On Wed, Dec 11, 2019 at 3:45 AM Zheng Zhenyu <zheng.zhenyu@outlook.com> wrote:
Hi Marko,
First off, sorry for the delay of reply of your previous question, I was
on travel and I saw few people already replied so I didn't do it.
As for your suggestion, yes, it will be no problem, I think Vlad take
cares of the Docker images. The resource are donated to the community, and the community can do whatever is required.
Thank you. I have asked Vlad to diagnose this problem. I am not completely aware how the Buildbot integration works and who is responsible for what, but now I have a slightly better idea of it.
Gordan Bobic: I apologize for my brain-fart. Obviously we cannot simply abandon less widely used platforms for stable release branches. If we did that, GNU/Linux distributions could hit the same problems and could be unable to update their packages for some architectures. My personal preference would be to *additionally* have some bleeding-edge compilers and environments running on the continuous integration environment. -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
-- Vlad
Hello Kevin, It was my pleasure to meet you in Shanghai. On my flight back, I worked on a micro-optimization, trying to make sure that native loads or stores are being used instead of memcpy(), memset(), memcmp(), when the data is known to be aligned. I filed a ticket for it: https://jira.mariadb.org/browse/MDEV-21133 Optimize access to InnoDB page header fields My colleague Eugene Kosov pointed out that such loads or stores are undefined behaviour (and cmake -DWITH_UBSAN=ON would likely agree). But, he showed that wrapping the arguments of <string.h> functions with __builtin_assume_aligned() actually works: https://godbolt.org/z/jCF_6q Eugene also pointed out to some related work: http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1774r1.pdf I found a claim that Aarch64 does support unaligned access in practice: https://stackoverflow.com/questions/38535738/does-aarch64-support-unaligned-... Can you provide a more authoritative answer? Is there some flag that should be passed to gcc or clang to enable it to generate simpler code? I also found a claim that POWER8 supports unaligned access, and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.) Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area. Related note: Maybe a year ago, I was positively surprised to learn that the InnoDB monster function mach_read_from_4() is being translated into a single 80486 BSWAP instruction, or an AMD64 MOVBE instruction. With best regards, Marko
On Mon, 25 Nov 2019 11:32:07 +0200 Marko Mäkelä <marko.makela@mariadb.com> wrote:
I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
Related note: Maybe a year ago, I was positively surprised to learn that the InnoDB monster function mach_read_from_4() is being translated into a single 80486 BSWAP instruction, or an AMD64 MOVBE instruction.
Yes, compilers are getting pretty good as are libc implementations of occasionally re-invented code (threads, mutexes, copy functions etc.). Daniel Black IBM Power systems
Hi Daniel, On Tue, Nov 26, 2019 at 2:02 AM Daniel Black <daniel@linux.ibm.com> wrote:
On Mon, 25 Nov 2019 11:32:07 +0200 Marko Mäkelä <marko.makela@mariadb.com> wrote:
I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
I pushed my micro-optimization to 10.5: https://github.com/MariaDB/server/commit/25e2a556de2e125784d52a0c7ccda4fa659... If there really is no compiler flag that would allow any memcpy(), memset(), memcmp() of 2,4,8 bytes to be translated into simple (possibly unaligned) multi-byte instructions, then we might add further MY_ASSUME_ALIGNED() assertions here and there, to allow gcc and clang to generate better code for POWER and ARM. If the compiler is smart enough, it might suffice to implement an accessor for buf_block_t or buf_block_t::frame that would MY_ASSUME_ALIGNED(frame, 4096). Then the compiler might correctly infer the alignment of (block->frame + some_compile_time_constant) and enable the optimization. I would be unwilling to pepper such hints all over the code. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
On Tue, 26 Nov 2019 10:56:41 +0200 Marko Mäkelä <marko.makela@mariadb.com> wrote:
Hi Daniel,
On Tue, Nov 26, 2019 at 2:02 AM Daniel Black <daniel@linux.ibm.com> wrote:
On Mon, 25 Nov 2019 11:32:07 +0200 Marko Mäkelä <marko.makela@mariadb.com> wrote:
I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
I pushed my micro-optimization to 10.5: https://github.com/MariaDB/server/commit/25e2a556de2e125784d52a0c7ccda4fa659...
If there really is no compiler flag that would allow any memcpy(), memset(), memcmp()
Well, actually: -fno-builtin-mem{cmp,set,cpy} -mmem{set,cpy}-strategy= (seems x86 only)
of 2,4,8 bytes to be translated into simple (possibly unaligned) multi-byte instructions,
Actually gcc has put an effort into getting the optimum implementation here already. It doesn't look like a thing an end application should be trying to optimise. $ rm -f memset_opt.o && gcc -O1 -fomit-frame-pointer -c memset_opt.c -o memset_opt.o && objdump -d memset_opt.o | grep -A 10 vmem 0000000000000000 <vmemset>: 0: c7 07 00 00 00 00 movl $0x0,(%rdi) 6: c3 retq 0000000000000007 <vmemcmp>: 7: 48 83 ec 18 sub $0x18,%rsp b: 89 7c 24 0c mov %edi,0xc(%rsp) f: ba 04 00 00 00 mov $0x4,%edx 14: 48 8d 74 24 0c lea 0xc(%rsp),%rsi 19: bf 00 00 00 00 mov $0x0,%edi 1e: e8 00 00 00 00 callq 23 <vmemcmp+0x1c> 23: 48 83 c4 18 add $0x18,%rsp 27: c3 retq 0000000000000028 <vmemstatic>: 28: b8 ff ff ff ff mov $0xffffffff,%eax 2d: c3 retq 000000000000002e <vmemcpy>: 2e: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 34 <vmemcpy+0x6> 34: 89 07 mov %eax,(%rdi) 36: c3 retq [dan@volution junk]$ cat memset_opt.c #include <string.h> static int comp = 7; char r[30]; void vmemset(char v[30]) { memset(v, 0, 4); } int vmemcmp(int c) { return memcmp(&comp, &c, sizeof(c)); } int vmemstatic() { return memcmp("cat", "dog", 3); } void vmemcpy(int *c) { memcpy(c, r, sizeof(*c)); } Not sure why vmemcmp still has a memcpy call, but by vmemstatic some understanding is there. script to test: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43052#c12 Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (laptop) glibc-2.29-22.fc30 gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1) $ sh test_stringop 64 640000000 gcc -march=native memcpy mode:64 size:640000000 libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg byte profiled dynamic block size 8192000 0:00.12 0:00.12 0:00.12 0:00.13 0:00.12 0:00.12 0:00.13 0:00.13 0:00.13 0:00.13 0:00.13 0:00.50 0:00.12 0:00.12 best: 0:00.12 libcall block size 819200 0:00.08 0:00.10 0:00.10 0:00.10 0:00.10 0:00.10 0:00.10 0:00.09 0:00.09 0:00.09 0:00.09 0:00.48 0:00.08 0:00.08 best: 0:00.08 libcall block size 81920 0:00.04 0:00.05 0:00.05 0:00.05 0:00.04 0:00.04 0:00.05 0:00.09 0:00.08 0:00.07 0:00.08 0:00.51 0:00.04 0:00.04 best: 0:00.04 libcall block size 20480 0:00.04 0:00.04 0:00.04 0:00.04 0:00.04 0:00.04 0:00.04 0:00.07 0:00.11 0:00.08 0:00.08 0:00.86 0:00.03 0:00.04 best: 0:00.04 libcall block size 8192 0:00.03 0:00.03 0:00.04 0:00.04 0:00.03 0:00.03 0:00.03 0:00.06 0:00.10 0:00.06 0:00.07 0:00.48 0:00.03 0:00.03 best: 0:00.03 libcall block size 4096 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.06 0:00.10 0:00.06 0:00.07 0:00.47 0:00.03 0:00.03 best: 0:00.03 libcall block size 2048 0:00.03 0:00.03 0:00.03 0:00.04 0:00.03 0:00.04 0:00.03 0:00.07 0:00.10 0:00.06 0:00.07 0:00.48 0:00.03 0:00.03 best: 0:00.03 libcall block size 1024 0:00.04 0:00.04 0:00.04 0:00.05 0:00.04 0:00.05 0:00.04 0:00.08 0:00.11 0:00.07 0:00.07 0:00.49 0:00.03 0:00.04 best: 0:00.04 libcall block size 512 0:00.05 0:00.06 0:00.06 0:00.06 0:00.05 0:00.06 0:00.05 0:00.09 0:00.12 0:00.07 0:00.07 0:00.50 0:00.09 0:00.06 best: 0:00.05 libcall block size 256 0:00.07 0:00.08 0:00.08 0:00.09 0:00.08 0:00.09 0:00.08 0:00.10 0:00.12 0:00.09 0:00.09 0:00.52 0:00.10 0:00.09 best: 0:00.07 libcall block size 128 0:00.11 0:00.13 0:00.13 0:00.15 0:00.13 0:00.15 0:00.13 0:00.14 0:00.14 0:00.12 0:00.11 0:00.56 0:00.12 0:00.12 best: 0:00.11 libcall block size 64 0:00.20 0:00.20 0:00.20 0:00.24 0:00.22 0:00.24 0:00.22 0:00.19 0:00.20 0:00.18 0:00.19 0:00.75 0:00.18 0:00.18 best: 0:00.18 unrl block size 48 0:00.25 0:00.28 0:00.28 0:00.31 0:00.31 0:00.31 0:00.29 0:00.23 0:00.22 0:00.22 0:00.23 0:00.66 0:00.22 0:00.22 best: 0:00.22 loopnoalign block size 32 0:00.38 0:00.40 0:00.38 0:00.44 0:00.39 0:00.45 0:00.38 0:00.30 0:00.32 0:00.30 0:00.31 0:00.93 0:00.31 0:00.31 best: 0:00.30 loop block size 24 0:00.51 0:00.57 0:00.56 0:00.63 0:00.58 0:00.64 0:00.56 0:00.42 0:00.40 0:00.36 0:00.37 0:00.78 0:00.37 0:00.36 best: 0:00.36 unrl block size 16 0:00.75 0:00.74 0:00.74 0:00.84 0:00.74 0:00.85 0:00.70 0:00.48 0:00.47 0:00.40 0:00.40 0:00.87 0:00.48 0:00.47 best: 0:00.40 unrl block size 14 0:00.76 0:00.95 0:00.97 0:01.01 0:00.99 0:01.01 0:00.89 0:00.52 0:00.51 0:00.49 0:00.49 0:00.88 0:00.54 0:00.53 best: 0:00.49 unrl block size 12 0:00.93 0:01.10 0:01.10 0:01.14 0:01.05 0:01.19 0:00.98 0:00.64 0:00.61 0:00.56 0:00.57 0:00.79 0:00.64 0:00.59 best: 0:00.56 unrl block size 10 0:01.04 0:01.31 0:01.31 0:01.37 0:01.23 0:01.41 0:01.14 0:00.75 0:00.74 0:00.68 0:00.65 0:00.86 0:00.68 0:00.69 best: 0:00.65 unrlnoalign block size 8 0:01.36 0:01.59 0:01.55 0:01.68 0:01.37 0:01.64 0:01.18 0:00.79 0:00.79 0:00.73 0:00.73 0:00.89 0:00.81 0:00.78 best: 0:00.73 unrl block size 6 0:01.66 0:02.25 0:02.23 0:02.31 0:02.01 0:02.31 0:01.57 0:01.01 0:00.96 0:00.99 0:01.01 0:01.02 0:01.00 0:01.01 best: 0:00.96 loopnoalign block size 4 0:02.68 0:03.24 0:03.47 0:03.21 0:02.65 0:01.38 0:01.36 0:01.41 0:01.34 0:01.38 0:01.35 0:01.47 0:01.26 0:01.32 best: 0:01.34 loopnoalign block size 1 0:05.41 0:17.59 0:17.41 0:01.52 0:01.51 0:01.46 0:01.50 0:01.59 0:01.49 0:01.56 0:01.52 0:02.43 0:02.39 0:02.42 best: 0:01.46 rep8 memset libcall rep1 noalg rep4 noalg rep8 noalg loop noalg unrl noalg byte profiled dynamic block size 8192000 0:00.05 0:00.05 0:00.06 0:00.05 0:00.05 0:00.05 0:00.05 0:00.11 0:00.09 0:00.09 0:00.11 0:00.47 0:00.05 0:00.05 best: 0:00.05 libcall block size 819200 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.05 0:00.10 0:00.08 0:00.07 0:00.08 0:00.45 0:00.06 0:00.05 best: 0:00.05 libcall block size 81920 0:00.03 0:00.03 0:00.03 0:00.04 0:00.04 0:00.03 0:00.03 0:00.10 0:00.07 0:00.06 0:00.07 0:00.47 0:00.04 0:00.03 best: 0:00.03 libcall block size 20480 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.08 0:00.11 0:00.07 0:00.08 0:00.51 0:00.03 0:00.03 best: 0:00.03 libcall block size 8192 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.09 0:00.06 0:00.06 0:00.06 0:00.45 0:00.03 0:00.03 best: 0:00.03 libcall block size 4096 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.03 0:00.04 0:00.10 0:00.06 0:00.06 0:00.06 0:00.41 0:00.03 0:00.03 best: 0:00.03 libcall block size 2048 0:00.04 0:00.03 0:00.04 0:00.04 0:00.03 0:00.04 0:00.03 0:00.10 0:00.06 0:00.06 0:00.06 0:00.41 0:00.04 0:00.04 best: 0:00.03 rep1 block size 1024 0:00.05 0:00.04 0:00.05 0:00.05 0:00.04 0:00.05 0:00.04 0:00.10 0:00.07 0:00.07 0:00.07 0:00.41 0:00.05 0:00.05 best: 0:00.04 rep1 block size 512 0:00.07 0:00.06 0:00.06 0:00.07 0:00.06 0:00.07 0:00.06 0:00.11 0:00.07 0:00.08 0:00.07 0:00.42 0:00.07 0:00.07 best: 0:00.06 rep1 block size 256 0:00.11 0:00.08 0:00.08 0:00.10 0:00.08 0:00.10 0:00.08 0:00.12 0:00.09 0:00.09 0:00.09 0:00.44 0:00.10 0:00.11 best: 0:00.08 rep1 block size 128 0:00.15 0:00.13 0:00.13 0:00.15 0:00.13 0:00.15 0:00.12 0:00.15 0:00.12 0:00.13 0:00.13 0:00.50 0:00.14 0:00.14 best: 0:00.12 loopnoalign block size 64 0:00.28 0:00.21 0:00.22 0:00.25 0:00.23 0:00.23 0:00.22 0:00.20 0:00.20 0:00.20 0:00.21 0:00.50 0:00.20 0:00.20 best: 0:00.20 loop block size 48 0:00.31 0:00.27 0:00.27 0:00.30 0:00.30 0:00.29 0:00.28 0:00.24 0:00.23 0:00.24 0:00.25 0:00.59 0:00.24 0:00.24 best: 0:00.23 loopnoalign block size 32 0:00.47 0:00.36 0:00.36 0:00.40 0:00.37 0:00.40 0:00.37 0:00.30 0:00.31 0:00.31 0:00.30 0:00.58 0:00.31 0:00.31 best: 0:00.30 loop block size 24 0:00.62 0:00.55 0:00.55 0:00.59 0:00.56 0:00.55 0:00.52 0:00.35 0:00.35 0:00.35 0:00.36 0:00.66 0:00.35 0:00.34 best: 0:00.35 loop block size 16 0:00.92 0:00.78 0:00.72 0:00.76 0:00.70 0:00.71 0:00.63 0:00.40 0:00.40 0:00.33 0:00.34 0:00.67 0:00.39 0:00.40 best: 0:00.33 unrl block size 14 0:00.98 0:00.94 0:00.95 0:00.95 0:00.91 0:00.90 0:00.85 0:00.43 0:00.43 0:00.39 0:00.39 0:00.68 0:00.43 0:00.43 best: 0:00.39 unrl block size 12 0:01.16 0:01.11 0:01.10 0:01.09 0:01.03 0:01.01 0:00.87 0:00.43 0:00.46 0:00.44 0:00.43 0:00.72 0:00.46 0:00.45 best: 0:00.43 loop block size 10 0:01.39 0:01.33 0:01.33 0:01.29 0:01.21 0:01.17 0:00.99 0:00.49 0:00.51 0:00.50 0:00.55 0:00.84 0:00.58 0:00.57 best: 0:00.49 loop block size 8 0:01.87 0:01.51 0:01.47 0:01.43 0:01.26 0:01.27 0:00.96 0:00.57 0:00.56 0:00.52 0:00.51 0:00.83 0:00.56 0:00.55 best: 0:00.51 unrlnoalign block size 6 0:02.17 0:02.26 0:02.29 0:01.99 0:01.80 0:01.56 0:01.27 0:00.70 0:00.70 0:00.74 0:00.72 0:00.92 0:00.74 0:00.71 best: 0:00.70 loop block size 4 0:03.16 0:03.16 0:03.11 0:02.47 0:02.04 0:01.02 0:00.95 0:00.92 0:00.93 0:00.91 0:00.93 0:01.09 0:00.93 0:01.08 best: 0:00.91 unrl block size 1 0:04.64 0:17.11 0:18.85 0:01.78 0:01.79 0:01.77 0:01.76 0:01.74 0:01.79 0:01.70 0:01.68 0:02.05 0:01.27 0:02.27 best: 0:01.68 unrlnoalign For non-x86 I modified the above script (at attached) to run the memX and compare to it with {-fno-builtin-X} root@ozrom2:~# sh test_stringop 64 640000000 gcc -mcpu=power9 | tee out.txt root@ozrom2:~# gcc --version gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 memcpy mode:64 size:640000000 libcall nobuiltin byte profiled block size 8192000 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall block size 819200 0:00.04 0:00.03 0:00.04 best: 0:00.03 nobuiltin block size 81920 0:00.03 0:00.03 0:00.03 best: 0:00.03 libcall block size 20480 0:00.03 0:00.03 0:00.04 best: 0:00.03 libcall block size 8192 0:00.03 0:00.04 0:00.04 best: 0:00.03 libcall block size 4096 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall block size 2048 0:00.05 0:00.05 0:00.05 best: 0:00.05 libcall block size 1024 0:00.07 0:00.07 0:00.06 best: 0:00.07 libcall block size 512 0:00.09 0:00.09 0:00.10 best: 0:00.09 libcall block size 256 0:00.12 0:00.12 0:00.11 best: 0:00.12 libcall block size 128 0:00.19 0:00.20 0:00.19 best: 0:00.19 libcall block size 64 0:00.32 0:00.32 0:00.31 best: 0:00.32 libcall block size 48 0:00.45 0:00.44 0:00.46 best: 0:00.44 nobuiltin block size 32 0:00.58 0:00.59 0:00.58 best: 0:00.58 libcall block size 24 0:00.82 0:00.83 0:00.81 best: 0:00.82 libcall block size 16 0:01.09 0:01.10 0:01.05 best: 0:01.09 libcall block size 14 0:01.36 0:01.36 0:01.32 best: 0:01.36 libcall block size 12 0:01.58 0:01.57 0:01.56 best: 0:01.57 nobuiltin block size 10 0:01.88 0:01.88 0:01.84 best: 0:01.88 libcall block size 8 0:02.14 0:02.14 0:02.03 best: 0:02.14 libcall block size 6 0:03.24 0:03.24 0:03.01 best: 0:03.24 libcall block size 4 0:04.27 0:04.26 0:03.90 best: 0:04.26 nobuiltin block size 1 0:18.42 0:18.45 0:15.29 best: 0:18.42 libcall memset libcall nobuiltin byte profiled block size 8192000 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall block size 819200 0:00.04 0:00.03 0:00.04 best: 0:00.03 nobuiltin block size 81920 0:00.04 0:00.03 0:00.03 best: 0:00.03 nobuiltin block size 20480 0:00.04 0:00.04 0:00.03 best: 0:00.04 libcall block size 8192 0:00.03 0:00.04 0:00.03 best: 0:00.03 libcall block size 4096 0:00.04 0:00.04 0:00.04 best: 0:00.04 libcall block size 2048 0:00.05 0:00.05 0:00.05 best: 0:00.05 libcall block size 1024 0:00.07 0:00.07 0:00.06 best: 0:00.07 libcall block size 512 0:00.09 0:00.09 0:00.10 best: 0:00.09 libcall block size 256 0:00.13 0:00.12 0:00.11 best: 0:00.12 nobuiltin block size 128 0:00.19 0:00.18 0:00.19 best: 0:00.18 nobuiltin block size 64 0:00.31 0:00.32 0:00.31 best: 0:00.31 libcall block size 48 0:00.44 0:00.45 0:00.45 best: 0:00.44 libcall block size 32 0:00.58 0:00.58 0:00.58 best: 0:00.58 libcall block size 24 0:00.82 0:00.82 0:00.81 best: 0:00.82 libcall block size 16 0:01.09 0:01.09 0:01.05 best: 0:01.09 libcall block size 14 0:01.36 0:01.36 0:01.32 best: 0:01.36 libcall block size 12 0:01.57 0:01.58 0:01.55 best: 0:01.57 libcall block size 10 0:01.90 0:01.90 0:01.83 best: 0:01.90 libcall block size 8 0:02.14 0:02.15 0:02.05 best: 0:02.14 libcall block size 6 0:03.20 0:03.20 0:03.03 best: 0:03.20 libcall block size 4 0:04.26 0:04.27 0:03.91 best: 0:04.26 libcall block size 1 0:18.43 0:18.44 0:15.30 best: 0:18.43 libcall So its pretty much better or identical to use memset/cmp in all cases the ones showing up as nobuiltin are pretty much in the noise of measurement.
then we might add further MY_ASSUME_ALIGNED() assertions here and there, to allow gcc and clang to generate better code for POWER and ARM.
If the compiler is smart enough, it might suffice to implement an accessor for buf_block_t or buf_block_t::frame that would MY_ASSUME_ALIGNED(frame, 4096). Then the compiler might correctly infer the alignment of (block->frame + some_compile_time_constant) and enable the optimization. I would be unwilling to pepper such hints all over the code.
Marko
Hi Daniel, You seem to be right that the compilers are already mostly doing the right thing. Here is a notable exception where GCC lags behind clang (unnecessary use of stack): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89804 I created another test program, checking how mach_read_from_4() gets compiled. It turns out that on Aarch64 and POWER, unaligned reads are being used by default: https://godbolt.org/z/ZcavM4 For 32-bit ARM, -march=armv6 seems to enable unaligned reads. For RISC-V and WebAssembly, the code is rather ugly. :-) So, indeed, there does not appear to be much to micro-optimize here. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
For what it's worth, unaligned access does come with a performance penalty, typically somewhere in the 1-10% range on x86, depending on the generation of chip used. It has been _mostly_ mitigated on recent x86 chips, and IIRC Intel's C compiler does have an option to align all structs and arrays to a 16 byte boundary. I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those >= ARMv6) the unaligned access performance hit was quite dramatic. On Wed, Nov 27, 2019 at 11:36 AM Marko Mäkelä <marko.makela@mariadb.com> wrote:
Hi Daniel,
You seem to be right that the compilers are already mostly doing the right thing. Here is a notable exception where GCC lags behind clang (unnecessary use of stack): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89804
I created another test program, checking how mach_read_from_4() gets compiled. It turns out that on Aarch64 and POWER, unaligned reads are being used by default: https://godbolt.org/z/ZcavM4
For 32-bit ARM, -march=armv6 seems to enable unaligned reads. For RISC-V and WebAssembly, the code is rather ugly. :-)
So, indeed, there does not appear to be much to micro-optimize here.
Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On Thu, Nov 28, 2019 at 3:24 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
For what it's worth, unaligned access does come with a performance penalty, typically somewhere in the 1-10% range on x86, depending on the generation of chip used. It has been _mostly_ mitigated on recent x86 chips, and IIRC Intel's C compiler does have an option to align all structs and arrays to a 16 byte boundary.
Yes, there is overhead, and there are some unfortunate design choices (or problems) with the InnoDB page format. Luckily, most page header and footer fields are reasonably aligned.
I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those >= ARMv6) the unaligned access performance hit was quite dramatic.
I wonder if the unaligned access could ever end up costing more than the instruction decoding overhead for implementing multi-byte access via single-byte operations. (In the past, when unaligned access could have been supported by an interrupt to the operating system, like Digital UNIX on the Alpha, I could easily believe it. But, now we are talking about hardware-supported unaligned access.) Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
On Fri, Nov 29, 2019 at 9:22 AM Marko Mäkelä <marko.makela@mariadb.com> wrote:
On Thu, Nov 28, 2019 at 3:24 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those = ARMv6) the unaligned access performance hit was quite dramatic.
I wonder if the unaligned access could ever end up costing more than the instruction decoding overhead for implementing multi-byte access via single-byte operations. (In the past, when unaligned access could have been supported by an interrupt to the operating system, like Digital UNIX on the Alpha, I could easily believe it. But, now we are talking about hardware-supported unaligned access.)
Last time I measured it, the difference was somewhere in the region of 20x slower on ARMv5 (between auto-alignment fixup in the kernel enabled and disabled for code that does unaligned access). Obviously, the code that does unaligned access with the auto-fixup disabled would just read/write garbage, with tragic consequences in some cases. One of the reasons I stopped using ext4, for example, is because when I started working on ARM32, I discovered that fsck.ext4 is guilty of loading fs blocks into char[4096], and being char this array is byte aligned. Unfortunately, it would then go on to cast this into a struct with a bigger alignment requirement. The rest you can probably imagine. Most developers are not even aware that this kind of a problem exists because they only ever wrote code that runs on platforms that have transparent alignment fixup like x86, so the worst case scenario is that it runs slower rather than resulting in outright data corruption. IIRC Intel compiler's 16-byte align option effectively makes every array definition happen as if it were pragma aligned to 16 bytes explicitly, this avoiding the problem. Of course, that doesn't help on any platform other than x86, and there it's main purpose is for optimizing auto-vectorization of loops that operate on such arrays.
participants (5)
-
Daniel Black
-
Gordan Bobic
-
Marko Mäkelä
-
Vlad Bogolin
-
Zheng Zhenyu