[Maria-developers] Next steps in improving single-threaded performance
I have been analysing CPU bottlenecks in single-threaded sysbench read-only load. I found that icache misses is the main bottleneck, and that profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25% or more. (More details in my blog posts: http://kristiannielsen.livejournal.com/17676.html http://kristiannielsen.livejournal.com/18168.html ) Now I would like to ask for some discussions/help in how to get this implemented in practice. It involves changing the build process for our binaries: First compile with gcc --coverage, then run some profile workload, then recompile with -fprofile-use. I implemented a simple program to generate some profile load: https://github.com/knielsen/gen_profile_load It runs a bunch of simple insert/select/update/delete, with different combinations of storage engine, binlog format, and client API. It is designed to run inside the build tree and handle starting and stopping the server being tested, so it is pretty close to a working setup. These commands work to generate a binary that is faster due to PGO: mkdir bld cd bld cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" .. make tests/gen_profile_load cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" make So all the pieces really are there, it should be possible to implement it. But we need to find a good way to integrate it into our build system. The best would be to integrate it into our cmake files. The gen_profile_load.c could go into tests/, ideally we would build both a static and dynamically linked version (so we get PGO for both libmysqlclient.a and libmysqlclient.so). Anyone can help me get cmake to do that? And it would be cool if we could get the above procedure to work completely within cmake, so that the user could just do: cmake -DWITH_PGO ... ; make and cmake would itself handle first building with --coverage, then running gen_profile_load.static and gen_profile_load.dynamic, then rebuilding with -fprofile-use. Anyone know if this is possible with cmake, and if so could help implement it? But alternatively, we could integrate a double build, like the commands above, into the buildbot scripts (.deb, .rpm, bintar). Any comments? Here are some more points: - I tested that gen_profile_load gives a good speedup of sysbench read-only (around 30%, so still very significant even though it generates a different and more varied load). - As another test, I removed all SELECT from gen_profile_load, and ran the resulting PGO binary with sysbench read-only. This still gave a fair speedup, despite the PGO load being completely different from the benchmark load. This gives me confidence that the PGO should not cause performance regressions in cases not covered well by gen_profile_load - More tests would be nice, of course. Axel, would you be able to build some binaries following above procedure, and test some different random benchmarks? Anything that is easy to run could be interesting, both to test for improvement, and to check against regressions. - We probably need a recent GCC version to get good results. I used GCC version 4.7.2. Maybe we should install this GCC version in all the VMs we use to build binaries? - Should we do this in 5.5? I think we might want to. The speedup is quite significant, and it seems very safe - no code modifications are involved, only different compiler options. Any thoughts? Volunteeres for helping with the cmake or buildbot parts? - Kristian.
Hi Kristian, just out of curiosity: is it possible to find out which functions cause highest amount of icache misses? Can it have anything to do with branch misprediction? Regards, Sergey On Fri, Jan 24, 2014 at 03:51:25PM +0100, Kristian Nielsen wrote:
I have been analysing CPU bottlenecks in single-threaded sysbench read-only load. I found that icache misses is the main bottleneck, and that profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25% or more.
(More details in my blog posts:
http://kristiannielsen.livejournal.com/17676.html http://kristiannielsen.livejournal.com/18168.html )
Now I would like to ask for some discussions/help in how to get this implemented in practice. It involves changing the build process for our binaries: First compile with gcc --coverage, then run some profile workload, then recompile with -fprofile-use.
I implemented a simple program to generate some profile load:
https://github.com/knielsen/gen_profile_load
It runs a bunch of simple insert/select/update/delete, with different combinations of storage engine, binlog format, and client API. It is designed to run inside the build tree and handle starting and stopping the server being tested, so it is pretty close to a working setup. These commands work to generate a binary that is faster due to PGO:
mkdir bld cd bld cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 --coverage" .. make
tests/gen_profile_load
cmake -DWITHOUT_PERFSCHEMA_STORAGE_ENGINE=1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-Wno-maybe-uninitialized -g -O3 -fprofile-use -fprofile-correction" make
So all the pieces really are there, it should be possible to implement it. But we need to find a good way to integrate it into our build system.
The best would be to integrate it into our cmake files.
The gen_profile_load.c could go into tests/, ideally we would build both a static and dynamically linked version (so we get PGO for both libmysqlclient.a and libmysqlclient.so). Anyone can help me get cmake to do that?
And it would be cool if we could get the above procedure to work completely within cmake, so that the user could just do:
cmake -DWITH_PGO ... ; make
and cmake would itself handle first building with --coverage, then running gen_profile_load.static and gen_profile_load.dynamic, then rebuilding with -fprofile-use. Anyone know if this is possible with cmake, and if so could help implement it?
But alternatively, we could integrate a double build, like the commands above, into the buildbot scripts (.deb, .rpm, bintar).
Any comments? Here are some more points:
- I tested that gen_profile_load gives a good speedup of sysbench read-only (around 30%, so still very significant even though it generates a different and more varied load).
- As another test, I removed all SELECT from gen_profile_load, and ran the resulting PGO binary with sysbench read-only. This still gave a fair speedup, despite the PGO load being completely different from the benchmark load. This gives me confidence that the PGO should not cause performance regressions in cases not covered well by gen_profile_load
- More tests would be nice, of course. Axel, would you be able to build some binaries following above procedure, and test some different random benchmarks? Anything that is easy to run could be interesting, both to test for improvement, and to check against regressions.
- We probably need a recent GCC version to get good results. I used GCC version 4.7.2. Maybe we should install this GCC version in all the VMs we use to build binaries?
- Should we do this in 5.5? I think we might want to. The speedup is quite significant, and it seems very safe - no code modifications are involved, only different compiler options.
Any thoughts? Volunteeres for helping with the cmake or buildbot parts?
- Kristian.
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp
Sergey Vojtovich <svoj@mariadb.org> writes:
just out of curiosity: is it possible to find out which functions cause highest amount of icache misses?
Yes, see the second post, the profiles marked "Icache misses (ICACHE.MISSES), before PGO" and "Icache misses (ICACHE.MISSES), after PGO". These are level 1 cache misses. You will see that the functions with high cache miss rate are more or less the same as the functions that execute a lot of instructions. Note however that according to Intel documentation, there is a large skid on those events, so one should not rely too much on the precise location reported.
Can it have anything to do with branch misprediction?
If you look at the same post, you will see profiles for BR_MISP_RETIRED.ALL_BRANCHES_PS. This is a precise event, so it points directly to the instruction after the mispredicted branch. We do get 12% or so less mispredictions, so it has some effect. In comparison, we get 23% fewer icache misses. Note that the main source of branch misprediction is frequently called shared library functions (due to the indirect jump in PLT), and virtual function calls. This suggests that the problem here is that the sheer number of branches executed causes eviction of otherwise correctly predicted branches. We are simply executing too much code per request for the CPU to handle efficiently, a common thing in server applications. Another improvement that I noticed is in make_join_statistics(). PGO uses calls to optimised memset() and memcpy() functions for large structure memory writes, instead of byte-by-byte "rep movsb" sequences. There are probably many small improvements that contribute to the overall speedup spread out over the code, it is hard to determine precisely with such a large code base. The reason I mention icache misses in particular is that 1. The performance counter measurements pre-PGO clearly shows that icache miss rate is the main bottleneck in the CPU. 2. PGO is well suited to reducing icache misses. 3. Indeed, measurements post-PGO show a significant reduction in icache misses. - Kristian.
Hi Kristian, yes, the second post answers most of my questions. Somehow I missed it, sorry. Still a questions mostly to educate myself. According to proc mysqld executable size is something like: VmExe: 12228 kB VmLib: 6272 kB I assume the above refers to overall instructions. Level 1 instruction cache size is like 32Kb, right? When you say that we're executing too much code per request, did you mean the above? Do you think we can get similar speedup by putting compiler hints (e.g. likely/unlikely) and code optimizations? Thanks, Sergey On Mon, Jan 27, 2014 at 10:14:07AM +0100, Kristian Nielsen wrote:
Sergey Vojtovich <svoj@mariadb.org> writes:
just out of curiosity: is it possible to find out which functions cause highest amount of icache misses?
Yes, see the second post, the profiles marked "Icache misses (ICACHE.MISSES), before PGO" and "Icache misses (ICACHE.MISSES), after PGO". These are level 1 cache misses.
You will see that the functions with high cache miss rate are more or less the same as the functions that execute a lot of instructions. Note however that according to Intel documentation, there is a large skid on those events, so one should not rely too much on the precise location reported.
Can it have anything to do with branch misprediction?
If you look at the same post, you will see profiles for BR_MISP_RETIRED.ALL_BRANCHES_PS. This is a precise event, so it points directly to the instruction after the mispredicted branch. We do get 12% or so less mispredictions, so it has some effect. In comparison, we get 23% fewer icache misses.
Note that the main source of branch misprediction is frequently called shared library functions (due to the indirect jump in PLT), and virtual function calls. This suggests that the problem here is that the sheer number of branches executed causes eviction of otherwise correctly predicted branches. We are simply executing too much code per request for the CPU to handle efficiently, a common thing in server applications.
Another improvement that I noticed is in make_join_statistics(). PGO uses calls to optimised memset() and memcpy() functions for large structure memory writes, instead of byte-by-byte "rep movsb" sequences.
There are probably many small improvements that contribute to the overall speedup spread out over the code, it is hard to determine precisely with such a large code base. The reason I mention icache misses in particular is that
1. The performance counter measurements pre-PGO clearly shows that icache miss rate is the main bottleneck in the CPU.
2. PGO is well suited to reducing icache misses.
3. Indeed, measurements post-PGO show a significant reduction in icache misses.
- Kristian.
Sergey Vojtovich <svoj@mariadb.org> writes:
Still a questions mostly to educate myself. According to proc mysqld executable size is something like: VmExe: 12228 kB VmLib: 6272 kB
I assume the above refers to overall instructions. Level 1 instruction cache size is like 32Kb, right?
Yes, 32Kb.
When you say that we're executing too much code per request, did you mean the above?
No. I was refering to the actual code that is touched by the given load. In my sysbench read-only benchmarks, we run around 40000 instructions per query. But some of those are in loops, so it is unknown how many distinct instructions need to be fetched (maybe cachegrind could help determine this). If the live set, that is the actual instructions executed in a given load, would fit in L1 instruction cache, then we would see a large gain in performance. That might not be possible to achieve, though. My hypothesis is that the reduction in icache misses from PGO comes from the compiler being able to re-arrange the basic blocks of the code so that the actual benchmark load ends up with fewer and larger straight-line code execution paths. This would help reduce the number of half-used cache lines in the icache, and also help the hardware prefetcher being able to reduce the impact of icache misses. The actual size of the executable does not matter much, only the parts that are actually executed during a given load.
Do you think we can get similar speedup by putting compiler hints (e.g. likely/unlikely) and code optimizations?
I do not know for sure, but I think it is unlikely. We may be able to get some of the speedup with such hints. But as I remember the GCC documentation, there are a number of optimisations that are only enabled if actually using profile-guided optimisation. But it is hard to say for sure... - Kristian.
Hi Kristian, Kristian Nielsen wrote:
I have been analysing CPU bottlenecks in single-threaded sysbench read-only load. I found that icache misses is the main bottleneck, and that profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25% or more.
(More details in my blog posts:
http://kristiannielsen.livejournal.com/17676.html http://kristiannielsen.livejournal.com/18168.html )
Wow. 25% is a lot. Have you also tried compiling MySQL 5.6 with PGO? Because if that gets the same improvement, we haven't won anything in the comparison. I played a bit with PGO back at SAP - when we worked with the Intel guys and used the Intel compiler. One of the bottlenecks we found there was the SQL parser. It was just too big to fit into L1 cache. And also too big to be optimized at once.
Any comments? Here are some more points:
- I tested that gen_profile_load gives a good speedup of sysbench read-only (around 30%, so still very significant even though it generates a different and more varied load).
This is interesting. By definition, PGO should work best if the workload used for profiling matches the production workload. I hadn't expected that a partial match gives such good results too. This is something that needs more tests.
- More tests would be nice, of course. Axel, would you be able to build some binaries following above procedure, and test some different random benchmarks? Anything that is easy to run could be interesting, both to test for improvement, and to check against regressions.
Yes, I'll certainly do that. Speaking of regressions - if we plan to deliver binaries built with PGO, we must also test the influence of different architectures. I.e. how behaves a binary built on Intel when being run on AMD.
- We probably need a recent GCC version to get good results. I used GCC version 4.7.2. Maybe we should install this GCC version in all the VMs we use to build binaries?
That is the gcc version installed @ lizard2. The facebook machines still have a 4.5.x (SuSE specific snapshot from 2010 ... WTF?). Jani: could you look into upgrading gcc on the facebook machines? XL
Axel Schwenke <axel@askmonty.org> writes:
Wow. 25% is a lot. Have you also tried compiling MySQL 5.6 with PGO?
No.
Because if that gets the same improvement, we haven't won anything in the comparison.
On the contrary, if the same works for MySQL 5.6 (and it seems likely it will), then we have won double - both users on MariaDB _and_ users on MySQL 5.6 will benefit from increased performance. (I know what you mean, of course, but seriously - the goal is to improve performance for as many users as possible, not to "win" in some marketing stunt. For me, it is.)
This is interesting. By definition, PGO should work best if the workload used for profiling matches the production workload. I hadn't expected that a partial match gives such good results too.
This is something that needs more tests.
Agree that more tests would be good. Even if the workload is not identical, there should be many common code paths, which would explain that there is still some improvement. If PGO would increase only one particular workload and slow down the rest, then it would be unattractive. My tests so far seem to show that this is not the case, but more testing would be good.
Yes, I'll certainly do that.
Cool. Let me know if you need any help - hopefully the procedure I wrote in the earlier mail will work for you. Note that currently, the gen_profile_load program needs the build directory to be a directory inside the source directory, IIRC (eg. mariadb-10.0/build/).
Speaking of regressions - if we plan to deliver binaries built with PGO, we must also test the influence of different architectures. I.e. how behaves a binary built on Intel when being run on AMD.
Hm, yes it would be good to test, but do you expect any issues here? I did not use any cpu-specific options, and the profiling output should be independent of the underlying cpu, I think? - Kristian.
Kristian Nielsen wrote:
I implemented a simple program to generate some profile load:
I propose the following change to make it work with MariaDB releases before 10.0.4 (and MySQL) that lack the SHUTDOWN statement: --- gen_profile_load.c.orig 2014-02-11 14:01:34.896583280 +0100 +++ gen_profile_load.c 2014-02-12 15:44:24.107310585 +0100 @@ -560,7 +560,7 @@ { int status; - do_queryf("shutdown"); + mysql_shutdown(&mysql, SHUTDOWN_DEFAULT); if (mysqld_pid <= 0) return; XL
Hi, Kristian Nielsen wrote:
I have been analysing CPU bottlenecks in single-threaded sysbench read-only load. I found that icache misses is the main bottleneck, and that profile-guided compiler optimisation (PGO) with GCC gives a large speedup, 25% or more.
Here are some more results. Benchmark 1 is good old sysbench OLTP. I tested 10.0.7 vs. 10.0.7-pgo. With low concurrency there is about 10% win by PGO; however this is completely reversed at higher concurrency by mutex contention (the test was with performance schema disabled, so cannot say which mutex, probably LOCK_open). Normally I run with preloaded tcmalloc. However since 10.0.5(?) MariaDB uses jemalloc internally. Since this is built with MariaDB, it could benefit from PGO. However number look quite similar for tcmalloc vs. jemalloc. The other benchmark is purely single threaded and runs Q1 from DBT3 for memory based data. Here I include data for many MariaDB and MySQL versions for comparison. The plot is a classical box-and-whiskers plot where the box contains 50% of the data points (25-75 percentile) and the whiskers mark minimum and maximum. This time the win is about 5% for MariaDB-10.0.8 and ~ 0 for MariaDB-5.5.35. However those results should be taken with a grain of salt as those builds have been done with older gcc-4.6.3. I'll have to re-run with gcc-4.7.2 builds (but on different hardware). BR, XL
Axel Schwenke <axel@askmonty.org> writes:
Benchmark 1 is good old sysbench OLTP. I tested 10.0.7 vs. 10.0.7-pgo. With low concurrency there is about 10% win by PGO; however this is completely reversed at higher concurrency by mutex contention (the test was with performance schema disabled, so cannot say which mutex, probably LOCK_open).
Ouch, pgo drops the throughput to 1/2! That's a pretty serious blow to the whole idea, unless there is not just a fix but also a good explanation. I will investigate this, thanks a lot for testing! I must say it is totally unexpected. I would have expected the effect of pgo (whether positive or negative) to be most pronounced at low concurrency, since at high concurrency lock contention dominates, which mainly happens in kernel and library code. And it is strange that a performance improvement at low concurrency manifests itself as a loss at high concurrency. Maybe pgo re-arranges data to optimise cache sharing? And this introduces more false sharing? But this needs to be checked properly.
The other benchmark is purely single threaded and runs Q1 from DBT3 for memory based data. Here I include data for many MariaDB and MySQL versions for comparison. The plot is a classical box-and-whiskers plot where the box contains 50% of the data points (25-75 percentile) and the whiskers mark minimum and maximum.
If I understand correctly, the noise in those tests is really too big to tell much one way or the other, right? Well, low-concurrency tests also shows the effect of single-threaded performance just fine. But clearly, unless there is some explanation for the hit at high concurrency, the PGO idea is not looking attractive... Again, thanks a lot for looking into this, I will try to find time soon to investigate more. - Kristian.
Kristian Nielsen <knielsen@knielsen-hq.org> writes:
Axel Schwenke <axel@askmonty.org> writes:
Benchmark 1 is good old sysbench OLTP. I tested 10.0.7 vs. 10.0.7-pgo. With low concurrency there is about 10% win by PGO; however this is completely reversed at higher concurrency by mutex contention (the test was with performance schema disabled, so cannot say which mutex, probably LOCK_open).
Ouch, pgo drops the throughput to 1/2!
That's a pretty serious blow to the whole idea, unless there is not just a fix but also a good explanation. I will investigate this, thanks a lot for testing!
Ok, so I finally got the time to investigate this. I think I understand what is going on. So the original problem was that PGO (profile-guided optimisation) showed a fair improvement at lower concurrency, but a significant reduction in throughput at higher concurrency, in sysbench OLTP. It turns out that the real problem is unrelated to PGO. At higher concurrency, the server code basically falls over, so that adding more concurrent work significantly decreases the throughput. This is a well-known phenomenon. As a side effect, if we improve the code performance of a single thread, we effectively increase the concurrency in the critical spots - threads spend less time executing the real code, hence more time in concurrency bottlenecks. The end result is that _any_ change that improves single-threaded performance causes throughput to decrease at concurrency levels where the code falls over. To verify this, I repeated XL's sysbench runs on a number of different mysqld servers. Apart from XL's original 10.0 and 10.0-pgo, I added a run with _no_ optimisations (-O0), and some runs where I used PGO but deliberately decreased performance by putting a dummy loop into the query execution code. Here are the results for sysbench read-write: Transactions per second in sysbench OLTP read-write (higher is better): 16-rw 128-rw 256-rw 512-rw 10.0-nopgo 6680.84 13004.87 7850.10 4031.06 10.0-pgo 7249.39 12199.32 6336.47 2614.58 10.0-pgo-pause1000 7040.25 12081.80 5825.99 2464.58 10.0-pgo-pause2000 6774.10 12024.44 5810.60 2433.14 10.0-pgo-pause4000 6469.06 12859.23 6479.85 2589.90 10.0-pgo-pause8000 5779.67 13233.35 7074.85 2741.01 10.0-pgo-pause16000 4710.97 12286.62 7896.23 2889.25 10.0-noopt 4004.37 9613.89 7920.67 3268.46 As we see, there is a strong correlation between higher throughput at low concurrency, and lower throughput at high concurrency. As we add more dummy overhead to the PGO server, throughput at the high concurrency increases, and compiling with -O0 is even faster. The sysbench read-only results are similar, though less pronounced as the code now does not fall over so badly (I used recent 10.0 bzr, maybe Svoj's work on LOCK_open has helped solve the problem, or maybe me compiling performance schema out made a difference): Transactions per second in sysbench OLTP read-write (higher is better): 16-ro 128-ro 256-ro 512-ro 10.0-nopgo 8903.62 19034.44 18369.42 15933.65 10.0-pgo 9602.81 20057.09 19084.66 13128.61 10.0-pgo-pause1000 9169.94 20403.00 18814.08 12708.24 10.0-pgo-pause2000 8870.11 20307.68 18618.01 13015.76 10.0-pgo-pause4000 8331.52 19903.76 18425.81 13459.38 10.0-pgo-pause8000 7610.22 18897.86 17650.32 13544.74 10.0-pgo-pause16000 6079.60 16654.55 15853.86 14008.75 10.0-noopt 4969.67 12830.43 12263.99 11438.69 Again, at the concurrency levels where pgo is slower than non-pgo, we can improve throughput by inserting dummy pause loop code. So the conclusion here is that PGO is actually a viable optimisation (and we should do it for the binaries we release, and if possible integrate it into the cmake builds so from-source builds will also benefit from it). The high-concurrency sysbench results are meaningless in terms of single-threaded improvements, as any such improvement ends up degrading TPS, and the real problem needs to be fixed elsewhere, by removing lock contention and so on. I will try next to investigate why the code falls over at high concurrency and see if anything can be done... It also appears that sysbench results at high concurrency are mostly meaningless for comparison between different code versions, unless we can see that one version falls over and the other does not. Hopefully we can find a way to eliminate the catastrophic performance hit at high concurrency... - Kristian.
Hi Kristian On 04/22/2014 04:11 PM, Kristian Nielsen wrote:
As a side effect, if we improve the code performance of a single thread, we effectively increase the concurrency in the critical spots - threads spend less time executing the real code, hence more time in concurrency bottlenecks. The end result is that _any_ change that improves single-threaded performance causes throughput to decrease at concurrency levels where the code falls over.
I have a more "education" focused question, rather than commenting on the issue or the results/arguments. What do you mean with the phrase [code|server] "falls over"? Also a quote from https://lists.launchpad.net/maria-developers/msg07210.html :
Basically, the server "falls over" and starts trashing instead of doing real work, due to some kind of inter-processor communication overhead.
I understand the concept of a program spending increased time in a) switching or scheduling switches b) communication between threads during increased concurrency levels. I am simply wondering if you mean anything more specific. Forgive my ignorance, and the thread hijacking :) Regards Vangelis
Vangelis Katsikaros <vkatsikaros@yahoo.gr> writes:
What do you mean with the phrase [code|server] "falls over"?
I am refering to a common phenomenon in high-concurrency benchmarks. See for example the graph in this email: https://lists.launchpad.net/maria-developers/msg06799.html https://lists.launchpad.net/maria-developers/gifSCfeVH5OFW.gif As the concurrency (number of client threads) increases, throughput also increases; this is what we call "scalability". At some point, we have sufficient concurrency to fully utilise all machine resources, and throughput no longer increases with more concurrency, this is expected. But if you look at the bars marked "10.0.7-pgo" in that graph, you see that the troughput actually dramatically _decreases_ with increasing concurrency. Such behaviour is rather undesirable. Imagine a real system that gets temporarily overloaded. New requests start arriving faster than they can be satisfied, effectively increasing concurrency. If increasing concurrency causes decreasing throughput, this can get into a negative feedback loop, eventually making the system almost unable to satisfy any requests. This behaviour, where the throughput does not remain mostly flat as concurrency increases, but instead dramatically decreases, is what I somewhat sloppyly refer to as "the server falling over". Hope this helps, - Kristian.
Hi Kristian On 04/30/2014 02:11 PM, Kristian Nielsen wrote:
Vangelis Katsikaros <vkatsikaros@yahoo.gr> writes:
What do you mean with the phrase [code|server] "falls over"?
I am refering to a common phenomenon in high-concurrency benchmarks. See for example the graph in this email:
https://lists.launchpad.net/maria-developers/msg06799.html https://lists.launchpad.net/maria-developers/gifSCfeVH5OFW.gif
As the concurrency (number of client threads) increases, throughput also increases; this is what we call "scalability". At some point, we have sufficient concurrency to fully utilise all machine resources, and throughput no longer increases with more concurrency, this is expected.
But if you look at the bars marked "10.0.7-pgo" in that graph, you see that the troughput actually dramatically _decreases_ with increasing concurrency. Such behaviour is rather undesirable. Imagine a real system that gets temporarily overloaded. New requests start arriving faster than they can be satisfied, effectively increasing concurrency. If increasing concurrency causes decreasing throughput, this can get into a negative feedback loop, eventually making the system almost unable to satisfy any requests.
This behaviour, where the throughput does not remain mostly flat as concurrency increases, but instead dramatically decreases, is what I somewhat sloppyly refer to as "the server falling over".
Thanks for the detailed explanation! I hadn't noticed the _drastic_ decrease in throughput and I thought you were referring to something else. Regards Vangelis
participants (4)
-
Axel Schwenke
-
Kristian Nielsen
-
Sergey Vojtovich
-
Vangelis Katsikaros