For what it's worth, unaligned access does come with a performance penalty, typically somewhere in the 1-10% range on x86, depending on the generation of chip used. It has been _mostly_ mitigated on recent x86 chips, and IIRC Intel's C compiler does have an option to align all structs and arrays to a 16 byte boundary.
I would be very interested to see some tests data on unalighed access cost on various aarch64 chips. On various 32-bit ARM chips (including those >= ARMv6) the unaligned access performance hit was quite dramatic.