Sergey Vojtovich <svoj@mariadb.org> writes:
Still a questions mostly to educate myself. According to proc mysqld executable size is something like: VmExe: 12228 kB VmLib: 6272 kB
I assume the above refers to overall instructions. Level 1 instruction cache size is like 32Kb, right?
Yes, 32Kb.
When you say that we're executing too much code per request, did you mean the above?
No. I was refering to the actual code that is touched by the given load. In my sysbench read-only benchmarks, we run around 40000 instructions per query. But some of those are in loops, so it is unknown how many distinct instructions need to be fetched (maybe cachegrind could help determine this). If the live set, that is the actual instructions executed in a given load, would fit in L1 instruction cache, then we would see a large gain in performance. That might not be possible to achieve, though. My hypothesis is that the reduction in icache misses from PGO comes from the compiler being able to re-arrange the basic blocks of the code so that the actual benchmark load ends up with fewer and larger straight-line code execution paths. This would help reduce the number of half-used cache lines in the icache, and also help the hardware prefetcher being able to reduce the impact of icache misses. The actual size of the executable does not matter much, only the parts that are actually executed during a given load.
Do you think we can get similar speedup by putting compiler hints (e.g. likely/unlikely) and code optimizations?
I do not know for sure, but I think it is unlikely. We may be able to get some of the speedup with such hints. But as I remember the GCC documentation, there are a number of optimisations that are only enabled if actually using profile-guided optimisation. But it is hard to say for sure... - Kristian.