One of the dangers of microbenchmarks is that they do not consider cache misses well. Unrolling one loop is no harm, but what happens when you have hundreds of unrolled loops in your program? I would imagine that the probability of cache misses increase greatly, perhaps so much that it regresses performance. Is it possible to control loop unrolling with a pragma, so that only the most performance sensitive loops get unrolled? Thanks, Roy Jay Pipes wrote:
Have we tried running the benchmarks with and without -funroll-loops?
-j
Stewart Smith wrote:
So i broke out mtaylor's patch that we were bumming around with at the UC to replace ptr_compare with a simple memcmp call.
At the UC I benched that this patch actually caused a measurable performance regresssion.
So what's the difference? (same benchmark and machine as in previous mail)
MAX_FIELDS=64 with std::bitset read/write requests: 70000 (3374.50 per sec.) read/write requests: 70000 (3102.42 per sec.) read/write requests: 70000 (3113.47 per sec.) read/write requests: 70000 (3401.89 per sec.) read/write requests: 70000 (3164.61 per sec.) AVERAGE= 3231.37
With ptr_compare replaced with memcmp: read/write requests: 70000 (3090.33 per sec.) read/write requests: 70000 (3066.97 per sec.) read/write requests: 70000 (2954.93 per sec.) read/write requests: 70000 (2875.57 per sec.) read/write requests: 70000 (2953.99 per sec.) AVERAGE= 2988.35
With ptr_compare replaced with __builtin_memcmp: read/write requests: 70000 (3245.37 per sec.) read/write requests: 70000 (3001.91 per sec.) read/write requests: 70000 (3254.99 per sec.) read/write requests: 70000 (3177.13 per sec.) read/write requests: 70000 (3195.49 per sec.) AVERAGE= 3174.97
So, I thought that with just using the __builtin_memcmp automatically, we may be okay with merging this.
However... I do have the following concerns: 1) I'm pretty sure that __builtin_memcmp is gcc only, and that SunStudio may have an issue (yay - wrappers!) 2) Is this perf difference going to be true on different platforms? 3) the various memcmp implementations do seem to be dependent on a few things for performance: alignment, size of data.
For what we're seeing in sysbench, the parameters are things like this: "SELECT c from sbtest where id between 9942 and 10041 order by c"
0xc30780,0xc2c340,240 0xc2e560,0xc2c340,240 0xc30780,0xc2e560,240 0xc37390,0xc32f50,240 0xc35170,0xc32f50,240 0xc37390,0xc35170,240 0xc3dcc8,0xc39888,240 0xc3baa8,0xc2e560,240 0xc35170,0xc2e560,240 0xc3baa8,0xc35170,240 0xc2c340,0xc35170,240 0xc35170,0xc3dcc8,240 0xc2c618,0xc35170,240 0xc35170,0xc3d9f0,240 0xc2c8f0,0xc35170,240 0x7fbed85f10f0,0x7fbed85eccb0,240
and from the CREATE TABLE: `c` VARCHAR(120) NOT NULL COLLATE utf8_general_ci DEFAULT `` ,
So here it's large, and aligned.
For some other micro-benchmarks I've clocked things looking much different.... so it's possibly query dependent as to what ends up being better (i.e. how much data we're comparing).
(1st parameter is number of repetitions, 2nd is number of bytes to memcmp)
For comparing equal values:
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8 168435455 repetitions
Testing memcmp ..... done, 7.432 seconds Testing builtin memcmp ....... done, 16.313 seconds Testing loop .... done, 6.144 seconds Testing loop32 .... done, 2.764 seconds Testing loop64 .... done, 2.128 seconds Testing no-op .... done, 1.488 seconds stewart@willster:~/src/test/memcmp$ ./a.out 168435455 16 168435455 repetitions
Testing memcmp ..... done, 5.912 seconds Testing builtin memcmp ....... done, 26.418 seconds Testing loop .... done, 11.201 seconds Testing loop32 .... done, 3.980 seconds Testing loop64 .... done, 2.564 seconds Testing no-op .... done, 1.492 seconds stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32 168435455 repetitions
Testing memcmp ..... done, 6.828 seconds Testing builtin memcmp ....... done, 46.891 seconds Testing loop .... done, 21.473 seconds Testing loop32 .... done, 6.536 seconds Testing loop64 .... done, 3.804 seconds Testing no-op .... done, 1.476 seconds stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64 168435455 repetitions
Testing memcmp ..... done, 9.549 seconds Testing builtin memcmp ....... done, 87.513 seconds Testing loop .... done, 41.455 seconds Testing loop32 .... done, 11.669 seconds Testing loop64 .... done, 6.368 seconds Testing no-op .... done, 1.468 seconds stewart@willster:~/src/test/memcmp$ ./a.out 168435455 128 168435455 repetitions
Testing memcmp ..... done, 15.081 seconds Testing builtin memcmp ....... done, 169.143 seconds Testing loop .... done, 86.397 seconds Testing loop32 .... done, 21.877 seconds Testing loop64 .... done, 11.445 seconds Testing no-op .... done, 1.488 seconds stewart@willster:~/src/test/memcmp$ gcc -O3 -fno-builtin memcmpbench.c stewart@willster:~/src/test/memcmp$ ./a.out 168435455 256 168435455 repetitions
Testing memcmp ..... done, 26.134 seconds Testing loop64 .... done, 21.549 seconds Testing no-op .... done, 1.500 seconds
Completely inequal values are all about the same:
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8 168435455 repetitions
Testing memcmp ..... done, 3.204 seconds Testing builtin memcmp ....... done, 13.945 seconds Testing loop .... done, 1.896 seconds Testing loop32 .... done, 2.092 seconds Testing loop64 .... done, 2.136 seconds Testing no-op .... done, 1.488 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 240 168435455 repetitions
Testing memcmp ..... done, 5.520 seconds Testing builtin memcmp ....... done, 13.973 seconds Testing loop .... done, 1.912 seconds Testing loop32 .... done, 2.124 seconds Testing loop64 .... done, 2.104 seconds Testing no-op .... done, 1.468 seconds
For half the same: stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8 168435455 repetitions
Testing memcmp ..... done, 7.340 seconds Testing builtin memcmp ....... done, 19.193 seconds Testing loop .... done, 4.212 seconds Testing loop32 .... done, 2.956 seconds Testing loop64 .... done, 2.112 seconds Testing no-op .... done, 1.476 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32 168435455 repetitions
Testing memcmp ..... done, 5.920 seconds Testing builtin memcmp ....... done, 34.526 seconds Testing loop .... done, 11.885 seconds Testing loop32 .... done, 4.444 seconds Testing loop64 .... done, 3.388 seconds Testing no-op .... done, 1.468 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64 168435455 repetitions
Testing memcmp ..... done, 6.948 seconds Testing builtin memcmp ....... done, 54.923 seconds Testing loop .... done, 22.081 seconds Testing loop32 .... done, 7.000 seconds Testing loop64 .... done, 4.444 seconds Testing no-op .... done, 1.488 seconds
Is concurrency an issue here and not raw single threaded performance?
Let's try with 64 threads (on the same 2 core box):
64 threads, __builtin_memcmp: read/write requests: 70000 (3184.20 per sec.) read/write requests: 70000 (3184.87 per sec.) read/write requests: 70000 (3166.83 per sec.) read/write requests: 70000 (3085.69 per sec.) AVERAGE=3155.39
64 threads, memcmp: read/write requests: 70000 (3218.07 per sec.) read/write requests: 70000 (3219.04 per sec.) read/write requests: 70000 (3222.48 per sec.) read/write requests: 70000 (3116.00 per sec.) AVERAGE=3193.89
64 threads, 32bit cmp loop: read/write requests: 70000 (3173.76 per sec.) read/write requests: 70000 (3156.23 per sec.) read/write requests: 70000 (3206.70 per sec.) read/write requests: 70000 (3247.61 per sec.) AVERAGE=3196.07
64 threads, 64bit cmp loop: read/write requests: 70000 (3210.02 per sec.) read/write requests: 70000 (3218.87 per sec.) read/write requests: 70000 (3225.98 per sec.) read/write requests: 70000 (3131.60 per sec.) AVERAGE= 3196.61
64 threads, baseline (using original ptr_compare): read/write requests: 70000 (3542.95 per sec.) read/write requests: 70000 (3556.15 per sec.) read/write requests: 70000 (3560.71 per sec.) read/write requests: 70000 (3476.80 per sec.) AVERAGE=3534.15
i.e. the old ptr_compare whoops the arse of any of the memcmp calls with higher concurrency.
So, after this I can conclude: - memcmp is nothing if not consistent across various microbenchmarks - builtin_memcmp seems to help a bit at lower concurrency, not at all at higher and is seldom useful in micro benchmark.
- the 64bit loop isn't so good at higher concurrency - the 64bit loop is favourable in microbenchmarks - memcmp is close to the 64bit loop performance (at most only 2x slower)
Unrolling loops being the key?
Testing loop .... done, 11.813 seconds Testing unrollloop .... done, 7.628 seconds Testing unrollloop32 .... done, 2.532 seconds Testing unrollloop64 .... done, 2.536 seconds Testing loop32 .... done, 4.432 seconds Testing loop64 .... done, 3.792 seconds Testing no-op .... done, 1.276 seconds
(32 and 64 do 32,64 bit compares in an unrolled loop)
Questions: - Why does my 64bit compare loop not equal performance of ptr_compare unrolled loop? (in a microbenchmark, it beats it). Concurrency again?
So what should we do?
Certainly a call to memcmp is easier to understand and keeps the code nice and simple, but we're talking up to a 10% speed hit here at higher concurrency (at least on my hardware).
I wonder what the difference is on various other hardware setups.
I'd be very interested to see what happens on sparc.
I'd also like to see it micro-benchmarked - preferably something we could repeat in future (even in ./configure ? on server startup ?)
I'm voting to keep the current ptr_compare code, and hope that somebody provides further explanation on top of what I've looked at here.
Below is the patch, followed by the code i used for microbenchmarking (note the commented out 64bit loop i used for the above loop tests):
=== modified file 'drizzled/filesort.cc' --- drizzled/filesort.cc 2009-04-28 00:17:10 +0000 +++ drizzled/filesort.cc 2009-05-08 05:59:11 +0000 @@ -1135,7 +1135,7 @@ int merge_buffers(SORTPARAM *param, IO_C } else { - cmp= get_ptr_compare(sort_length); + cmp= (qsort2_cmp)ptr_compare; first_cmp_arg= (void*) &sort_length; } priority_queue<BUFFPEK *, vector<BUFFPEK *>, compare_functor > === modified file 'mysys/mf_sort.cc' --- mysys/mf_sort.cc 2009-04-17 21:01:47 +0000 +++ mysys/mf_sort.cc 2009-05-08 05:59:11 +0000 @@ -34,7 +34,7 @@ void my_string_ptr_sort(unsigned char *b { if (size && items) { - my_qsort2(base,items, sizeof(unsigned char*), get_ptr_compare(size), + my_qsort2(base,items, sizeof(unsigned char*), (qsort2_cmp)ptr_compare, (void*) &size); } }
=== modified file 'mysys/my_sys.h' --- mysys/my_sys.h 2009-04-27 22:05:43 +0000 +++ mysys/my_sys.h 2009-05-08 05:59:11 +0000 @@ -420,7 +420,12 @@ extern void my_qsort(void *base_ptr, siz qsort_cmp cmp); extern void my_qsort2(void *base_ptr, size_t total_elems, size_t size, qsort2_cmp cmp, void *cmp_argument); -extern qsort2_cmp get_ptr_compare(size_t); + +#if defined(__cplusplus) +extern "C" +#endif +int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char **b); + void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t pos); my_off_t my_get_ptr(unsigned char *ptr, size_t pack_length); File create_temp_file(char *to, const char *dir, const char *pfx,
=== modified file 'mysys/ptr_cmp.cc' --- mysys/ptr_cmp.cc 2009-04-26 16:53:32 +0000 +++ mysys/ptr_cmp.cc 2009-05-08 05:59:11 +0000 @@ -21,137 +21,33 @@
#include "mysys/mysys_priv.h" #include "plugin/myisam/myisampack.h" - -static int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char **b); -static int ptr_compare_0(size_t *compare_length, unsigned char **a, unsigned char **b); -static int ptr_compare_1(size_t *compare_length, unsigned char **a, unsigned char **b); -static int ptr_compare_2(size_t *compare_length, unsigned char **a, unsigned char **b); -static int ptr_compare_3(size_t *compare_length, unsigned char **a, unsigned char **b); - - /* Get a pointer to a optimal byte-compare function for a given size */ - -qsort2_cmp get_ptr_compare (size_t size) -{ - if (size < 4) - return (qsort2_cmp) ptr_compare; - switch (size & 3) { - case 0: return (qsort2_cmp) ptr_compare_0; - case 1: return (qsort2_cmp) ptr_compare_1; - case 2: return (qsort2_cmp) ptr_compare_2; - case 3: return (qsort2_cmp) ptr_compare_3; - } - return 0; /* Impossible */ -} - +#include <string.h>
/* Compare to keys to see witch is smaller. - Loop unrolled to make it quick !! */
-#define cmp(N) if (first[N] != last[N]) return (int) first[N] - (int) last[N] - -static int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char **b) -{ - register int length= *compare_length; - register unsigned char *first,*last; - - first= *a; last= *b; - while (--length) - { - if (*first++ != *last++) - return (int) first[-1] - (int) last[-1]; - } - return (int) first[0] - (int) last[0]; -} - - -static int ptr_compare_0(size_t *compare_length,unsigned char **a, unsigned char **b) +extern "C" +int ptr_compare(size_t *compare_length, unsigned char **a, unsigned char **b) { - register int length= *compare_length; - register unsigned char *first,*last; - - first= *a; last= *b; - loop: - cmp(0); - cmp(1); - cmp(2); - cmp(3); - if ((length-=4)) +/* size_t l= *compare_length / sizeof(uint64_t); + uint64_t *aa= (uint64_t*)*a; + uint64_t *bb= (uint64_t*)*b; + while(l--) { - first+=4; - last+=4; - goto loop; - } - return (0); -} - - -static int ptr_compare_1(size_t *compare_length,unsigned char **a, unsigned char **b) -{ - register int length= *compare_length-1; - register unsigned char *first,*last; - - first= *a+1; last= *b+1; - cmp(-1); - loop: - cmp(0); - cmp(1); - cmp(2); - cmp(3); - if ((length-=4)) - { - first+=4; - last+=4; - goto loop; - } - return (0); -} - -static int ptr_compare_2(size_t *compare_length,unsigned char **a, unsigned char **b) -{ - register int length= *compare_length-2; - register unsigned char *first,*last; - - first= *a +2 ; last= *b +2; - cmp(-2); - cmp(-1); - loop: - cmp(0); - cmp(1); - cmp(2); - cmp(3); - if ((length-=4)) - { - first+=4; - last+=4; - goto loop; + if(*aa != *bb) + { + if(*aa < *bb) + return -1; + else + return 1; + } + aa++, bb++; } - return (0); + return 0; +*/ return memcmp(*a, *b, *compare_length); }
-static int ptr_compare_3(size_t *compare_length,unsigned char **a, unsigned char **b) -{ - register int length= *compare_length-3; - register unsigned char *first,*last; - - first= *a +3 ; last= *b +3; - cmp(-3); - cmp(-2); - cmp(-1); - loop: - cmp(0); - cmp(1); - cmp(2); - cmp(3); - if ((length-=4)) - { - first+=4; - last+=4; - goto loop; - } - return (0); -}
void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t pos) {
For those interested, i started with the code at http://gcc.gnu.org/ml/gcc/2002-10/msg01666.html
and ended up with something like this:
#include <sys/resource.h> #include <sys/time.h> #include <stdio.h> #include <assert.h> #include <stdlib.h>
char s1[256]; char s2[256];
int cmpsize= 240;
int speed_memcmp () { return memcmp (s1, s2, cmpsize); }
int speed_bimemcmp () { return __builtin_memcmp (s1, s2, cmpsize); }
int speed_loop () { int i=cmpsize; char *a= s1; char *b= s2; while(i--) { if(*a != *b) return -1; a++, b++; }; return 0; }
int speed_loop32 () { int i=cmpsize/4; int *a= (int*)s1; int *b= (int*)s2; while(i--) { if(*a != *b) return -1; a++, b++; }; return 0; }
int speed_loop64 () { int i=cmpsize/8; long long *a= (long long*)s1; long long *b= (long long*)s2; while(i--) { if(*a != *b) return -1; a++, b++; };
return 0; }
int speed_null() { return 0; }
double do_test (int repetitions, int (*test_function) ()) { struct rusage r1, r2;
getrusage(RUSAGE_SELF, &r1);
while (repetitions--) test_function ();
getrusage(RUSAGE_SELF, &r2); return (r2.ru_utime.tv_sec - r1.ru_utime.tv_sec) + (r2.ru_utime.tv_usec - r1.ru_utime.tv_usec) / 1000000.0; }
main(int argc, char **argv) { int repetitions = (argc == 1) ? 0x0fffffff : atoi(argv[1]); cmpsize = (argc == 2) ? 240 : atoi(argv[2]);
memset(s1, 42, 256); memset(s2, 42, 256); memset(s2+(cmpsize/2), 55, cmpsize/2);
printf ("%d repetitions\n\n", repetitions); printf ("Testing memcmp ....."); fflush (stdout); printf (" done, %10.3f seconds\n", do_test (repetitions, speed_memcmp));
printf ("Testing builtin memcmp ......."); fflush (stdout); printf (" done, %10.3f seconds\n", do_test (repetitions, speed_bimemcmp));
printf ("Testing loop ...."); fflush (stdout); printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop));
printf ("Testing loop32 ...."); fflush (stdout); printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop32));
printf ("Testing loop64 ...."); fflush (stdout); printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop64));
printf ("Testing no-op ...."); fflush (stdout); printf (" done, %10.3f seconds\n", do_test (repetitions, speed_null));
exit (0); }