I have been analysing CPU bottlenecks in single-threaded sysbench read-only load. I found that icache misses is the main bottleneck, and that profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25% or more. (More details in my blog posts: http://kristiannielsen.livejournal.com/17676.html http://kristiannielsen.livejournal.com/18168.html ) Now I would like to ask for some discussions/help in how to get this implemented in practice. It involves changing the build process for our binaries: First compile with gcc --coverage, then run some profile workload, then recompile with -fprofile-use. I implemented a simple program to generate some profile load: https://github.com/knielsen/gen_profile_load It runs a bunch of simple insert/select/update/delete, with different combinations of storage engine, binlog format, and client API. It is designed to run inside the build tree and handle starting and stopping the server being tested, so it is pretty close to a working setup. These commands work to generate a binary that is faster due to PGO: mkdir bld cd bld cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" .. make tests/gen_profile_load cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" make So all the pieces really are there, it should be possible to implement it. But we need to find a good way to integrate it into our build system. The best would be to integrate it into our cmake files. The gen_profile_load.c could go into tests/, ideally we would build both a static and dynamically linked version (so we get PGO for both libmysqlclient.a and libmysqlclient.so). Anyone can help me get cmake to do that? And it would be cool if we could get the above procedure to work completely within cmake, so that the user could just do: cmake -DWITH_PGO ... ; make and cmake would itself handle first building with --coverage, then running gen_profile_load.static and gen_profile_load.dynamic, then rebuilding with -fprofile-use. Anyone know if this is possible with cmake, and if so could help implement it? But alternatively, we could integrate a double build, like the commands above, into the buildbot scripts (.deb, .rpm, bintar). Any comments? Here are some more points: - I tested that gen_profile_load gives a good speedup of sysbench read-only (around 30%, so still very significant even though it generates a different and more varied load). - As another test, I removed all SELECT from gen_profile_load, and ran the resulting PGO binary with sysbench read-only. This still gave a fair speedup, despite the PGO load being completely different from the benchmark load. This gives me confidence that the PGO should not cause performance regressions in cases not covered well by gen_profile_load - More tests would be nice, of course. Axel, would you be able to build some binaries following above procedure, and test some different random benchmarks? Anything that is easy to run could be interesting, both to test for improvement, and to check against regressions. - We probably need a recent GCC version to get good results. I used GCC version 4.7.2. Maybe we should install this GCC version in all the VMs we use to build binaries? - Should we do this in 5.5? I think we might want to. The speedup is quite significant, and it seems very safe - no code modifications are involved, only different compiler options. Any thoughts? Volunteeres for helping with the cmake or buildbot parts? - Kristian.