On Thu, Nov 28, 2019 at 3:24 PM Gordan Bobic <gordan.bobic@gmail.com> wrote:
> I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those >= ARMv6) the unaligned access performance hit was quite dramatic.
I wonder if the unaligned access could ever end up costing more than
the instruction decoding overhead for implementing multi-byte access
via single-byte operations. (In the past, when unaligned access could
have been supported by an interrupt to the operating system, like
Digital UNIX on the Alpha, I could easily believe it. But, now we are
talking about hardware-supported unaligned access.)
Last time I measured it, the difference was somewhere in the region of 20x slower on ARMv5 (between auto-alignment fixup in the kernel enabled and disabled for code that does unaligned access).
Obviously, the code that does unaligned access with the auto-fixup disabled would just read/write garbage, with tragic consequences in some cases. One of the reasons I stopped using ext4, for example, is because when I started working on ARM32, I discovered that fsck.ext4 is guilty of loading fs blocks into char[4096], and being char this array is byte aligned. Unfortunately, it would then go on to cast this into a struct with a bigger alignment requirement. The rest you can probably imagine.
Most developers are not even aware that this kind of a problem exists because they only ever wrote code that runs on platforms that have transparent alignment fixup like x86, so the worst case scenario is that it runs slower rather than resulting in outright data corruption.
IIRC Intel compiler's 16-byte align option effectively makes every array definition happen as if it were pragma aligned to 16 bytes explicitly, this avoiding the problem. Of course, that doesn't help on any platform other than x86, and there it's main purpose is for optimizing auto-vectorization of loops that operate on such arrays.