Hello Kevin, It was my pleasure to meet you in Shanghai. On my flight back, I worked on a micro-optimization, trying to make sure that native loads or stores are being used instead of memcpy(), memset(), memcmp(), when the data is known to be aligned. I filed a ticket for it: https://jira.mariadb.org/browse/MDEV-21133 Optimize access to InnoDB page header fields My colleague Eugene Kosov pointed out that such loads or stores are undefined behaviour (and cmake -DWITH_UBSAN=ON would likely agree). But, he showed that wrapping the arguments of <string.h> functions with __builtin_assume_aligned() actually works: https://godbolt.org/z/jCF_6q Eugene also pointed out to some related work: http://open-std.org/JTC1/SC22/WG21/docs/papers/2019/p1774r1.pdf I found a claim that Aarch64 does support unaligned access in practice: https://stackoverflow.com/questions/38535738/does-aarch64-support-unaligned-... Can you provide a more authoritative answer? Is there some flag that should be passed to gcc or clang to enable it to generate simpler code? I also found a claim that POWER8 supports unaligned access, and I seem to remember that the latest version of the SPARC introduced support for that as well. (IA-32 and AMD64 have always supported unaligned access, except for some SIMD operations.) Last, I believe that we could get some performance benefits if include/byte_order_generic.h was rewritten in a suitable way. Ideally, include/byte_order_generic_x86_64.h would be replaced with a portable version of both, and compilers could simply perform the optimizations. I have been told that replacing the + in the macros with | could already be a good start. I would welcome patches in this area. Related note: Maybe a year ago, I was positively surprised to learn that the InnoDB monster function mach_read_from_4() is being translated into a single 80486 BSWAP instruction, or an AMD64 MOVBE instruction. With best regards, Marko