Re: [Maria-discuss] Some questions about the Aarch64 CI

29 Nov 2019

      On Fri, Nov 29, 2019 at 9:22 AM Marko Mäkelä <marko.makela@mariadb.com>
wrote:
...
On Thu, Nov 28, 2019 at 3:24 PM Gordan Bobic <gordan.bobic@gmail.com>
wrote:
...
I would be very interested to see some tests data on unalighed access
cost on various aarch64 chips. On various 32-bit ARM chips (including those
= ARMv6) the unaligned access performance hit was quite dramatic.
I wonder if the unaligned access could ever end up costing more than
the instruction decoding overhead for implementing multi-byte access
via single-byte operations. (In the past, when unaligned access could
have been supported by an interrupt to the operating system, like
Digital UNIX on the Alpha, I could easily believe it. But, now we are
talking about hardware-supported unaligned access.)
Last time I measured it, the difference was somewhere in the region of 20x
slower on ARMv5 (between auto-alignment fixup in the kernel enabled and
disabled for code that does unaligned access).
Obviously, the code that does unaligned access with the auto-fixup disabled
would just read/write garbage, with tragic consequences in some cases. One
of the reasons I stopped using ext4, for example, is because when I started
working on ARM32, I discovered that fsck.ext4 is guilty of loading fs
blocks into char[4096], and being char this array is byte aligned.
Unfortunately, it would then go on to cast this into a struct with a bigger
alignment requirement. The rest you can probably imagine.
Most developers are not even aware that this kind of a problem exists
because they only ever wrote code that runs on platforms that have
transparent alignment fixup like x86, so the worst case scenario is that it
runs slower rather than resulting in outright data corruption.

IIRC Intel compiler's 16-byte align option effectively makes every array
definition happen as if it were pragma aligned to 16 bytes explicitly, this
avoiding the problem. Of course, that doesn't help on any platform other
than x86, and there it's main purpose is for optimizing auto-vectorization
of loops that operate on such arrays.