[Maria-discuss] Some questions about the Aarch64 CI
Hi MariaDB, Thanks for the greate unconference in Shanghai this week, it is a really useful event for MariaDB newbie like me. We have mentioned that we are willing to donate some ARM resource to the fundation for ARM testing and other purpose in the future. And I have done donated one for POC on 20th Nov, and I can now see a new builder(worker) on: https://buildbot.mariadb.org/#/builders named as ``aarch-fedora-30``, and the version is 2019.11.20, so I guess this could be running on server that I've donated? And seems it is successfully running for few rounds. So I have some questions about the Aarch64 CI: 1. So the jobs are running through docker, so jobs for different OS could be run on this host, right? Is it possible to also enable a CentOS7 job now? 2. As I mentioned in the unconference, we will have an OpenSource OS released soon, and we are looking for possiblities to also make it tested in the upstream, If the answer to the first question is True, we will have to prepare a base docker image for our OS, right? 3. I saw that there is a sponsors site for buildbot: https://buildbot.mariadb.org/#/sponsor , are we able to be on that page too? BR, Kevin Zheng
1. So the jobs are running through docker, so jobs for different OS could be run on this host, right? Is it possible to also enable a CentOS7 job now?
Yes, you are right. We use docker to run the jobs and we can add more builders with different OSs.The new CentOS7 builder is up and running https://buildbot.mariadb.org/#/builders/33.
1. As I mentioned in the unconference, we will have an OpenSource OS released soon, and we are looking for possiblities to also make it tested in the upstream, If the answer to the first question is True, we will have to prepare a base docker image for our OS, right?
The easiest setup would be to have a docker image of the OS. However, if
Hi Kevin, First of all thank you very much for your support and graceful donation! You assumed correctly and the ``aarch-fedora-30`` builder is running on the new machine that you donated. However, I renamed the builder at this point, so the new one is ``aarch64-fedora-30``. Also, I have added a new CentOS7 builder, namely ``aarch64-centos-7``. You can see both of them on the builders page https://buildbot.mariadb.org/#/builders. Now, to answer your question that is not possible, we can discuss and come up with different other potential solutions. So, let us know when the OS is released.
1. I saw that there is a sponsors site for buildbot: https://buildbot.mariadb.org/#/sponsor , are we able to be on that page too?
I have updated the sponsors page. However, if you have any suggestions or
other requests regarding the sponsor page, let us know so that we can update it accordingly. Cheers, Vlad
Hi Vlad,
Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
BR,
________________________________
发件人: Vlad Bogolin
Hi Kevin,
On Mon, Nov 25, 2019 at 8:23 AM Zheng Zhenyu
Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
Related to this, I wonder if it would be possible to install a newer operating system (or Docker image), such as CentOS 8 or Debian 10 or the most recent Fedora. What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler: /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t _ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report). I remember seeing that kind of an error for some 64-bit atomic operation on a very old GCC targeting x86 (on CentOS 5 maybe?). While we have older compilers than GCC 4.8.5 for other instruction set architectures, I do not think that we run into internal compiler errors very often. On my AMD64 desktop, I am currently using GCC 9.2.1 and clang 9.0.0. As a developer, I prefer to have the most recent versions of tools whenever it is possible, for better diagnostics and possibly better optimizations. Best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
I thought EL7 is very much supported with aarch64. If that is the case
won't what you are suggesting effectively abandon EL7?
On Tue, Dec 10, 2019 at 9:25 PM Marko Mäkelä
Hi Kevin,
On Mon, Nov 25, 2019 at 8:23 AM Zheng Zhenyu
wrote: Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
Related to this, I wonder if it would be possible to install a newer operating system (or Docker image), such as CentOS 8 or Debian 10 or the most recent Fedora.
What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler:
/buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t
_ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn
More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report).
I remember seeing that kind of an error for some 64-bit atomic operation on a very old GCC targeting x86 (on CentOS 5 maybe?). While we have older compilers than GCC 4.8.5 for other instruction set architectures, I do not think that we run into internal compiler errors very often.
On my AMD64 desktop, I am currently using GCC 9.2.1 and clang 9.0.0. As a developer, I prefer to have the most recent versions of tools whenever it is possible, for better diagnostics and possibly better optimizations.
Best regards,
Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On Tue, 10 Dec 2019 21:54:37 +0000
Gordan Bobic
I thought EL7 is very much supported with aarch64. If that is the case won't what you are suggesting effectively abandon EL7?
I suggest that to help Red Hat get the ARM support you: a) search https://bugzilla.redhat.com/ for anything similar b) reproduce on a RHEL EL7 for arm. use: make VERBOSE=1 to extract the exact command line (including the change directory) add -save-temps to the c++ flags Include the generated .ic preprocessed file in the bug report/support ticket.
On Tue, Dec 10, 2019 at 9:25 PM Marko Mäkelä
wrote: What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler:
/buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t
_ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn
More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report).
Hi Marko,
First off, sorry for the delay of reply of your previous question, I was on travel and I saw few people already replied so I didn't do it.
As for your suggestion, yes, it will be no problem, I think Vlad take cares of the Docker images. The resource are donated to the community, and the community can do whatever is required.
Thanks,
Kevin Zheng
________________________________
发件人: Marko Mäkelä
Thanks alot for re quick response, the results looks cool. And Also, I noticed that the ARM job sometimes are slower than others in the fetch_tarball phase, this might due to that our machine is in China and the network connection is a little bit slow, I just got the info that our machine will be available in Singapore latter this month or ealier next month, maybe then we can then provide a machine with faster network which can speed up the jobs.
Related to this, I wonder if it would be possible to install a newer operating system (or Docker image), such as CentOS 8 or Debian 10 or the most recent Fedora. What prompts me to ask is that I just noticed a compilation failure of MariaDB 10.2 that might be addressed by upgrading to a newer compiler: /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc: In function 'dberr_t _ZL17row_log_apply_opsPK5trx_tP12dict_index_tP15row_merge_dup_tP16ut_stage_alter_t.isra.94(const trx_t*, dict_index_t*, row_merge_dup_t*)': /buildbot/aarch64-centos-7/build/storage/innobase/row/row0log.cc:3734:1: error: could not split insn More information is available in https://buildbot.mariadb.org/#/builders/33/builds/260 (including a request to submit a compiler bug report). I remember seeing that kind of an error for some 64-bit atomic operation on a very old GCC targeting x86 (on CentOS 5 maybe?). While we have older compilers than GCC 4.8.5 for other instruction set architectures, I do not think that we run into internal compiler errors very often. On my AMD64 desktop, I am currently using GCC 9.2.1 and clang 9.0.0. As a developer, I prefer to have the most recent versions of tools whenever it is possible, for better diagnostics and possibly better optimizations. Best regards, Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Hi Kevin,
On Wed, Dec 11, 2019 at 3:45 AM Zheng Zhenyu
Hi Marko,
First off, sorry for the delay of reply of your previous question, I was on travel and I saw few people already replied so I didn't do it.
As for your suggestion, yes, it will be no problem, I think Vlad take cares of the Docker images. The resource are donated to the community, and the community can do whatever is required.
Thank you. I have asked Vlad to diagnose this problem. I am not completely aware how the Buildbot integration works and who is responsible for what, but now I have a slightly better idea of it. Gordan Bobic: I apologize for my brain-fart. Obviously we cannot simply abandon less widely used platforms for stable release branches. If we did that, GNU/Linux distributions could hit the same problems and could be unable to update their packages for some architectures. My personal preference would be to *additionally* have some bleeding-edge compilers and environments running on the continuous integration environment. -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
Hi,
I finally have some updates regarding the internal compiler bug that Marko
reported. I have submitted a bug request
https://bugzilla.redhat.com/show_bug.cgi?id=1788104 but it seems that the
bug won't be fixed.
Cheers,
Vlad
On Wed, Dec 11, 2019 at 8:40 AM Marko Mäkelä
Hi Kevin,
On Wed, Dec 11, 2019 at 3:45 AM Zheng Zhenyu
wrote: Hi Marko,
First off, sorry for the delay of reply of your previous question, I was
on travel and I saw few people already replied so I didn't do it.
As for your suggestion, yes, it will be no problem, I think Vlad take
cares of the Docker images. The resource are donated to the community, and the community can do whatever is required.
Thank you. I have asked Vlad to diagnose this problem. I am not completely aware how the Buildbot integration works and who is responsible for what, but now I have a slightly better idea of it.
Gordan Bobic: I apologize for my brain-fart. Obviously we cannot simply abandon less widely used platforms for stable release branches. If we did that, GNU/Linux distributions could hit the same problems and could be unable to update their packages for some architectures. My personal preference would be to *additionally* have some bleeding-edge compilers and environments running on the continuous integration environment. -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
-- Vlad
Hello Kevin,
It was my pleasure to meet you in Shanghai.
On my flight back, I worked on a micro-optimization, trying to make
sure that native loads or stores are being used instead of memcpy(),
memset(), memcmp(), when the data is known to be aligned. I filed a
ticket for it:
https://jira.mariadb.org/browse/MDEV-21133 Optimize access to InnoDB
page header fields
My colleague Eugene Kosov pointed out that such loads or stores are
undefined behaviour (and cmake -DWITH_UBSAN=ON would likely agree).
But, he showed that wrapping the arguments of
On Mon, 25 Nov 2019 11:32:07 +0200
Marko Mäkelä
I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
Related note: Maybe a year ago, I was positively surprised to learn that the InnoDB monster function mach_read_from_4() is being translated into a single 80486 BSWAP instruction, or an AMD64 MOVBE instruction.
Yes, compilers are getting pretty good as are libc implementations of occasionally re-invented code (threads, mutexes, copy functions etc.). Daniel Black IBM Power systems
Hi Daniel,
On Tue, Nov 26, 2019 at 2:02 AM Daniel Black
On Mon, 25 Nov 2019 11:32:07 +0200 Marko Mäkelä
wrote: I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
I pushed my micro-optimization to 10.5: https://github.com/MariaDB/server/commit/25e2a556de2e125784d52a0c7ccda4fa659... If there really is no compiler flag that would allow any memcpy(), memset(), memcmp() of 2,4,8 bytes to be translated into simple (possibly unaligned) multi-byte instructions, then we might add further MY_ASSUME_ALIGNED() assertions here and there, to allow gcc and clang to generate better code for POWER and ARM. If the compiler is smart enough, it might suffice to implement an accessor for buf_block_t or buf_block_t::frame that would MY_ASSUME_ALIGNED(frame, 4096). Then the compiler might correctly infer the alignment of (block->frame + some_compile_time_constant) and enable the optimization. I would be unwilling to pepper such hints all over the code. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
On Tue, 26 Nov 2019 10:56:41 +0200
Marko Mäkelä
Hi Daniel,
On Tue, Nov 26, 2019 at 2:02 AM Daniel Black
wrote: On Mon, 25 Nov 2019 11:32:07 +0200 Marko Mäkelä
wrote: I also found a claim that POWER8 supports unaligned access,
This is correct (for the normal cacheable memory (i.e. not device IO mapped - so not applicable to mariadb))
and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.)
Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area.
I've never managed to get the time to look at these however a non-aligned version for non-common arches seems a better way to model this.
I pushed my micro-optimization to 10.5: https://github.com/MariaDB/server/commit/25e2a556de2e125784d52a0c7ccda4fa659...
If there really is no compiler flag that would allow any memcpy(), memset(), memcmp()
Well, actually: -fno-builtin-mem{cmp,set,cpy} -mmem{set,cpy}-strategy= (seems x86 only)
of 2,4,8 bytes to be translated into simple (possibly unaligned) multi-byte instructions,
Actually gcc has put an effort into getting the optimum implementation here already. It doesn't look like a thing an end application should be trying to optimise.
$ rm -f memset_opt.o && gcc -O1 -fomit-frame-pointer -c memset_opt.c -o memset_opt.o && objdump -d memset_opt.o | grep -A 10 vmem
0000000000000000 <vmemset>:
0: c7 07 00 00 00 00 movl $0x0,(%rdi)
6: c3 retq
0000000000000007 <vmemcmp>:
7: 48 83 ec 18 sub $0x18,%rsp
b: 89 7c 24 0c mov %edi,0xc(%rsp)
f: ba 04 00 00 00 mov $0x4,%edx
14: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
19: bf 00 00 00 00 mov $0x0,%edi
1e: e8 00 00 00 00 callq 23
then we might add further MY_ASSUME_ALIGNED() assertions here and there, to allow gcc and clang to generate better code for POWER and ARM.
If the compiler is smart enough, it might suffice to implement an accessor for buf_block_t or buf_block_t::frame that would MY_ASSUME_ALIGNED(frame, 4096). Then the compiler might correctly infer the alignment of (block->frame + some_compile_time_constant) and enable the optimization. I would be unwilling to pepper such hints all over the code.
Marko
Hi Daniel, You seem to be right that the compilers are already mostly doing the right thing. Here is a notable exception where GCC lags behind clang (unnecessary use of stack): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89804 I created another test program, checking how mach_read_from_4() gets compiled. It turns out that on Aarch64 and POWER, unaligned reads are being used by default: https://godbolt.org/z/ZcavM4 For 32-bit ARM, -march=armv6 seems to enable unaligned reads. For RISC-V and WebAssembly, the code is rather ugly. :-) So, indeed, there does not appear to be much to micro-optimize here. Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
For what it's worth, unaligned access does come with a performance penalty,
typically somewhere in the 1-10% range on x86, depending on the generation
of chip used. It has been _mostly_ mitigated on recent x86 chips, and IIRC
Intel's C compiler does have an option to align all structs and arrays to a
16 byte boundary.
I would be very interested to see some tests data on unalighed access cost
on various aarch64 chips. On various 32-bit ARM chips (including those >=
ARMv6) the unaligned access performance hit was quite dramatic.
On Wed, Nov 27, 2019 at 11:36 AM Marko Mäkelä
Hi Daniel,
You seem to be right that the compilers are already mostly doing the right thing. Here is a notable exception where GCC lags behind clang (unnecessary use of stack): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89804
I created another test program, checking how mach_read_from_4() gets compiled. It turns out that on Aarch64 and POWER, unaligned reads are being used by default: https://godbolt.org/z/ZcavM4
For 32-bit ARM, -march=armv6 seems to enable unaligned reads. For RISC-V and WebAssembly, the code is rather ugly. :-)
So, indeed, there does not appear to be much to micro-optimize here.
Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
_______________________________________________ Mailing list: https://launchpad.net/~maria-discuss Post to : maria-discuss@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-discuss More help : https://help.launchpad.net/ListHelp
On Thu, Nov 28, 2019 at 3:24 PM Gordan Bobic
For what it's worth, unaligned access does come with a performance penalty, typically somewhere in the 1-10% range on x86, depending on the generation of chip used. It has been _mostly_ mitigated on recent x86 chips, and IIRC Intel's C compiler does have an option to align all structs and arrays to a 16 byte boundary.
Yes, there is overhead, and there are some unfortunate design choices (or problems) with the InnoDB page format. Luckily, most page header and footer fields are reasonably aligned.
I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those >= ARMv6) the unaligned access performance hit was quite dramatic.
I wonder if the unaligned access could ever end up costing more than the instruction decoding overhead for implementing multi-byte access via single-byte operations. (In the past, when unaligned access could have been supported by an interrupt to the operating system, like Digital UNIX on the Alpha, I could easily believe it. But, now we are talking about hardware-supported unaligned access.) Marko -- Marko Mäkelä, Lead Developer InnoDB MariaDB Corporation
On Fri, Nov 29, 2019 at 9:22 AM Marko Mäkelä
On Thu, Nov 28, 2019 at 3:24 PM Gordan Bobic
wrote: I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those = ARMv6) the unaligned access performance hit was quite dramatic.
I wonder if the unaligned access could ever end up costing more than the instruction decoding overhead for implementing multi-byte access via single-byte operations. (In the past, when unaligned access could have been supported by an interrupt to the operating system, like Digital UNIX on the Alpha, I could easily believe it. But, now we are talking about hardware-supported unaligned access.)
Last time I measured it, the difference was somewhere in the region of 20x slower on ARMv5 (between auto-alignment fixup in the kernel enabled and disabled for code that does unaligned access). Obviously, the code that does unaligned access with the auto-fixup disabled would just read/write garbage, with tragic consequences in some cases. One of the reasons I stopped using ext4, for example, is because when I started working on ARM32, I discovered that fsck.ext4 is guilty of loading fs blocks into char[4096], and being char this array is byte aligned. Unfortunately, it would then go on to cast this into a struct with a bigger alignment requirement. The rest you can probably imagine. Most developers are not even aware that this kind of a problem exists because they only ever wrote code that runs on platforms that have transparent alignment fixup like x86, so the worst case scenario is that it runs slower rather than resulting in outright data corruption. IIRC Intel compiler's 16-byte align option effectively makes every array definition happen as if it were pragma aligned to 16 bytes explicitly, this avoiding the problem. Of course, that doesn't help on any platform other than x86, and there it's main purpose is for optimizing auto-vectorization of loops that operate on such arrays.
participants (5)
-
Daniel Black
-
Gordan Bobic
-
Marko Mäkelä
-
Vlad Bogolin
-
Zheng Zhenyu