Re: [Maria-developers] [Drizzle-discuss] ptr_compare replace with memcmp and associated performance

10 May 2009

One of the dangers of microbenchmarks is that they do not consider cache 
misses well. Unrolling one loop is no harm, but what happens when you 
have hundreds of unrolled loops in your program? I would imagine that 
the probability of cache misses increase greatly, perhaps so much that 
it regresses performance.

Is it possible to control loop unrolling with a pragma, so that only the 
most performance sensitive loops get unrolled?

Thanks,
Roy

Jay Pipes wrote:
...
Have we tried running the benchmarks with and without -funroll-loops?
-j
Stewart Smith wrote:
...
So i broke out mtaylor's patch that we were bumming around with at the UC
to replace ptr_compare with a simple memcmp call.
At the UC I benched that this patch actually caused a measurable 
performance
regresssion.
So what's the difference?
(same benchmark and machine as in previous mail)
MAX_FIELDS=64 with std::bitset
    read/write requests:                 70000  (3374.50 per sec.)
    read/write requests:                 70000  (3102.42 per sec.)
    read/write requests:                 70000  (3113.47 per sec.)
    read/write requests:                 70000  (3401.89 per sec.)
    read/write requests:                 70000  (3164.61 per sec.)
AVERAGE= 3231.37
With ptr_compare replaced with memcmp:
    read/write requests:                 70000  (3090.33 per sec.)
    read/write requests:                 70000  (3066.97 per sec.)
    read/write requests:                 70000  (2954.93 per sec.)
    read/write requests:                 70000  (2875.57 per sec.)
    read/write requests:                 70000  (2953.99 per sec.)
AVERAGE= 2988.35
With ptr_compare replaced with __builtin_memcmp:
    read/write requests:                 70000  (3245.37 per sec.)
    read/write requests:                 70000  (3001.91 per sec.)
    read/write requests:                 70000  (3254.99 per sec.)
    read/write requests:                 70000  (3177.13 per sec.)
    read/write requests:                 70000  (3195.49 per sec.)
AVERAGE= 3174.97
So, I  thought that with just using the __builtin_memcmp 
automatically, we
may be okay with merging this.
However... I do have the following concerns:
1) I'm pretty sure that __builtin_memcmp is gcc only, and that SunStudio
may have an issue (yay - wrappers!)
2) Is this perf difference going to be true on different platforms?
3) the various memcmp implementations do seem to be dependent on a few
things for performance: alignment, size of data.
For what we're seeing in sysbench, the parameters are things like this:
"SELECT c from sbtest where id between 9942 and 10041 order by c"
0xc30780,0xc2c340,240
0xc2e560,0xc2c340,240
0xc30780,0xc2e560,240
0xc37390,0xc32f50,240
0xc35170,0xc32f50,240
0xc37390,0xc35170,240
0xc3dcc8,0xc39888,240
0xc3baa8,0xc2e560,240
0xc35170,0xc2e560,240
0xc3baa8,0xc35170,240
0xc2c340,0xc35170,240
0xc35170,0xc3dcc8,240
0xc2c618,0xc35170,240
0xc35170,0xc3d9f0,240
0xc2c8f0,0xc35170,240
0x7fbed85f10f0,0x7fbed85eccb0,240
and from the CREATE TABLE:
        `c` VARCHAR(120) NOT NULL  COLLATE utf8_general_ci DEFAULT `` ,
So here it's large, and aligned.
For some other micro-benchmarks I've clocked things looking much
different.... so it's possibly query dependent as to what ends up being
better (i.e. how much data we're comparing).
(1st parameter is number of repetitions, 2nd is number of bytes to 
memcmp)
For comparing equal values:
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
168435455 repetitions
Testing memcmp ..... done,      7.432 seconds
Testing builtin memcmp ....... done,     16.313 seconds
Testing loop .... done,      6.144 seconds
Testing loop32 .... done,      2.764 seconds
Testing loop64 .... done,      2.128 seconds
Testing no-op .... done,      1.488 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 16
168435455 repetitions
Testing memcmp ..... done,      5.912 seconds
Testing builtin memcmp ....... done,     26.418 seconds
Testing loop .... done,     11.201 seconds
Testing loop32 .... done,      3.980 seconds
Testing loop64 .... done,      2.564 seconds
Testing no-op .... done,      1.492 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32
168435455 repetitions
Testing memcmp ..... done,      6.828 seconds
Testing builtin memcmp ....... done,     46.891 seconds
Testing loop .... done,     21.473 seconds
Testing loop32 .... done,      6.536 seconds
Testing loop64 .... done,      3.804 seconds
Testing no-op .... done,      1.476 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64
168435455 repetitions
Testing memcmp ..... done,      9.549 seconds
Testing builtin memcmp ....... done,     87.513 seconds
Testing loop .... done,     41.455 seconds
Testing loop32 .... done,     11.669 seconds
Testing loop64 .... done,      6.368 seconds
Testing no-op .... done,      1.468 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 128
168435455 repetitions
Testing memcmp ..... done,     15.081 seconds
Testing builtin memcmp ....... done,    169.143 seconds
Testing loop .... done,     86.397 seconds
Testing loop32 .... done,     21.877 seconds
Testing loop64 .... done,     11.445 seconds
Testing no-op .... done,      1.488 seconds
stewart@willster:~/src/test/memcmp$ gcc -O3 -fno-builtin memcmpbench.c
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 256
168435455 repetitions
Testing memcmp ..... done,     26.134 seconds
Testing loop64 .... done,     21.549 seconds
Testing no-op .... done,      1.500 seconds
Completely inequal values are all about the same:
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
168435455 repetitions
Testing memcmp ..... done,      3.204 seconds
Testing builtin memcmp ....... done,     13.945 seconds
Testing loop .... done,      1.896 seconds
Testing loop32 .... done,      2.092 seconds
Testing loop64 .... done,      2.136 seconds
Testing no-op .... done,      1.488 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 240
168435455 repetitions
Testing memcmp ..... done,      5.520 seconds
Testing builtin memcmp ....... done,     13.973 seconds
Testing loop .... done,      1.912 seconds
Testing loop32 .... done,      2.124 seconds
Testing loop64 .... done,      2.104 seconds
Testing no-op .... done,      1.468 seconds
For half the same:
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 8
168435455 repetitions
Testing memcmp ..... done,      7.340 seconds
Testing builtin memcmp ....... done,     19.193 seconds
Testing loop .... done,      4.212 seconds
Testing loop32 .... done,      2.956 seconds
Testing loop64 .... done,      2.112 seconds
Testing no-op .... done,      1.476 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 32
168435455 repetitions
Testing memcmp ..... done,      5.920 seconds
Testing builtin memcmp ....... done,     34.526 seconds
Testing loop .... done,     11.885 seconds
Testing loop32 .... done,      4.444 seconds
Testing loop64 .... done,      3.388 seconds
Testing no-op .... done,      1.468 seconds
stewart@willster:~/src/test/memcmp$ ./a.out 168435455 64
168435455 repetitions
Testing memcmp ..... done,      6.948 seconds
Testing builtin memcmp ....... done,     54.923 seconds
Testing loop .... done,     22.081 seconds
Testing loop32 .... done,      7.000 seconds
Testing loop64 .... done,      4.444 seconds
Testing no-op .... done,      1.488 seconds
Is concurrency an issue here and not raw single threaded performance?
Let's try with 64 threads (on the same 2 core box):
64 threads, __builtin_memcmp:
    read/write requests:                 70000  (3184.20 per sec.)
    read/write requests:                 70000  (3184.87 per sec.)
    read/write requests:                 70000  (3166.83 per sec.)
    read/write requests:                 70000  (3085.69 per sec.)
AVERAGE=3155.39
64 threads, memcmp:
    read/write requests:                 70000  (3218.07 per sec.)
    read/write requests:                 70000  (3219.04 per sec.)
    read/write requests:                 70000  (3222.48 per sec.)
    read/write requests:                 70000  (3116.00 per sec.)
AVERAGE=3193.89
64 threads, 32bit cmp loop:
    read/write requests:                 70000  (3173.76 per sec.)
    read/write requests:                 70000  (3156.23 per sec.)
    read/write requests:                 70000  (3206.70 per sec.)
    read/write requests:                 70000  (3247.61 per sec.)
AVERAGE=3196.07
64 threads, 64bit cmp loop:
    read/write requests:                 70000  (3210.02 per sec.)
    read/write requests:                 70000  (3218.87 per sec.)
    read/write requests:                 70000  (3225.98 per sec.)
    read/write requests:                 70000  (3131.60 per sec.)
AVERAGE= 3196.61
64 threads, baseline (using original ptr_compare):
    read/write requests:                 70000  (3542.95 per sec.)
    read/write requests:                 70000  (3556.15 per sec.)
    read/write requests:                 70000  (3560.71 per sec.)
    read/write requests:                 70000  (3476.80 per sec.)
AVERAGE=3534.15
i.e. the old ptr_compare whoops the arse of any of the memcmp calls with
higher concurrency.
So, after this I can conclude:
- memcmp is nothing if not consistent across various microbenchmarks
- builtin_memcmp seems to help a bit at lower concurrency, not at all at
higher and is seldom useful in micro benchmark.
- the 64bit loop isn't so good at higher concurrency
- the 64bit loop is favourable in microbenchmarks
- memcmp is close to the 64bit loop performance (at most only 2x slower)
Unrolling loops being the key?
Testing loop .... done,     11.813 seconds
Testing unrollloop .... done,      7.628 seconds
Testing unrollloop32 .... done,      2.532 seconds
Testing unrollloop64 .... done,      2.536 seconds
Testing loop32 .... done,      4.432 seconds
Testing loop64 .... done,      3.792 seconds
Testing no-op .... done,      1.276 seconds
(32 and 64 do 32,64 bit compares in an unrolled loop)
Questions:
- Why does my 64bit compare loop not equal performance of ptr_compare
unrolled loop? (in a microbenchmark, it beats it). Concurrency again?
So what should we do?
Certainly a  call to memcmp is easier to understand and keeps the code
nice and simple, but we're talking up to a 10% speed hit here at higher
concurrency (at least on my hardware).
I wonder what the difference is on various other hardware setups.
I'd be very interested to see what happens on sparc.
I'd also like to see it micro-benchmarked - preferably something we
could repeat in future (even in ./configure ? on server startup ?)
I'm voting to keep the current ptr_compare code, and hope that somebody
provides further explanation on top of what I've looked at here.
Below is the patch, followed by the code i used for microbenchmarking
(note the commented out 64bit loop i used for the above loop tests):
=== modified file 'drizzled/filesort.cc'

--- drizzled/filesort.cc    2009-04-28 00:17:10 +0000
+++ drizzled/filesort.cc    2009-05-08 05:59:11 +0000
@@ -1135,7 +1135,7 @@ int merge_buffers(SORTPARAM *param, IO_C
   }
   else
   {
-    cmp= get_ptr_compare(sort_length);
+    cmp= (qsort2_cmp)ptr_compare;
     first_cmp_arg= (void*) &sort_length;
   }
   priority_queue<BUFFPEK *, vector<BUFFPEK *>, compare_functor >
=== modified file 'mysys/mf_sort.cc'
--- mysys/mf_sort.cc    2009-04-17 21:01:47 +0000
+++ mysys/mf_sort.cc    2009-05-08 05:59:11 +0000
@@ -34,7 +34,7 @@ void my_string_ptr_sort(unsigned char *b
   {
     if (size && items)
     {
-      my_qsort2(base,items, sizeof(unsigned char*), 
get_ptr_compare(size),
+      my_qsort2(base,items, sizeof(unsigned char*), 
(qsort2_cmp)ptr_compare,
                 (void*) &size);
     }
   }
=== modified file 'mysys/my_sys.h'
--- mysys/my_sys.h    2009-04-27 22:05:43 +0000
+++ mysys/my_sys.h    2009-05-08 05:59:11 +0000
@@ -420,7 +420,12 @@ extern void my_qsort(void *base_ptr, siz
                      qsort_cmp cmp);
 extern void my_qsort2(void *base_ptr, size_t total_elems, size_t size,
                       qsort2_cmp cmp, void *cmp_argument);
-extern qsort2_cmp get_ptr_compare(size_t);
+
+#if defined(__cplusplus)
+extern "C"
+#endif
+int ptr_compare(size_t *compare_length, unsigned char **a, unsigned 
char **b);
+
 void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t 
pos);
 my_off_t my_get_ptr(unsigned char *ptr, size_t pack_length);
 File create_temp_file(char *to, const char *dir, const char *pfx,
=== modified file 'mysys/ptr_cmp.cc'
--- mysys/ptr_cmp.cc    2009-04-26 16:53:32 +0000
+++ mysys/ptr_cmp.cc    2009-05-08 05:59:11 +0000
@@ -21,137 +21,33 @@
#include "mysys/mysys_priv.h"
 #include "plugin/myisam/myisampack.h"
-
-static int ptr_compare(size_t *compare_length, unsigned char **a, 
unsigned char **b);
-static int ptr_compare_0(size_t *compare_length, unsigned char **a, 
unsigned char **b);
-static int ptr_compare_1(size_t *compare_length, unsigned char **a, 
unsigned char **b);
-static int ptr_compare_2(size_t *compare_length, unsigned char **a, 
unsigned char **b);
-static int ptr_compare_3(size_t *compare_length, unsigned char **a, 
unsigned char **b);
-
-    /* Get a pointer to a optimal byte-compare function for a given 
size */
-
-qsort2_cmp get_ptr_compare (size_t size)
-{
-  if (size < 4)
-    return (qsort2_cmp) ptr_compare;
-  switch (size & 3) {
-    case 0: return (qsort2_cmp) ptr_compare_0;
-    case 1: return (qsort2_cmp) ptr_compare_1;
-    case 2: return (qsort2_cmp) ptr_compare_2;
-    case 3: return (qsort2_cmp) ptr_compare_3;
-    }
-  return 0;                    /* Impossible */
-}
-
+#include <string.h>
/*
       Compare to keys to see witch is smaller.
-      Loop unrolled to make it quick !!
     */
-#define cmp(N) if (first[N] != last[N]) return (int) first[N] - (int) 
last[N]
-
-static int ptr_compare(size_t *compare_length, unsigned char **a, 
unsigned char **b)
-{
-  register int length= *compare_length;
-  register unsigned char *first,*last;
-
-  first= *a; last= *b;
-  while (--length)
-  {
-    if (*first++ != *last++)
-      return (int) first[-1] - (int) last[-1];
-  }
-  return (int) first[0] - (int) last[0];
-}
-
-
-static int ptr_compare_0(size_t *compare_length,unsigned char **a, 
unsigned char **b)
+extern "C"
+int ptr_compare(size_t *compare_length, unsigned char **a, unsigned 
char **b)
 {
-  register int length= *compare_length;
-  register unsigned char *first,*last;
-
-  first= *a; last= *b;
- loop:
-  cmp(0);
-  cmp(1);
-  cmp(2);
-  cmp(3);
-  if ((length-=4))
+/*  size_t l= *compare_length / sizeof(uint64_t);
+  uint64_t *aa= (uint64_t*)*a;
+  uint64_t *bb= (uint64_t*)*b;
+  while(l--)
   {
-    first+=4;
-    last+=4;
-    goto loop;
-  }
-  return (0);
-}
-
-
-static int ptr_compare_1(size_t *compare_length,unsigned char **a, 
unsigned char **b)
-{
-  register int length= *compare_length-1;
-  register unsigned char *first,*last;
-
-  first= *a+1; last= *b+1;
-  cmp(-1);
- loop:
-  cmp(0);
-  cmp(1);
-  cmp(2);
-  cmp(3);
-  if ((length-=4))
-  {
-    first+=4;
-    last+=4;
-    goto loop;
-  }
-  return (0);
-}
-
-static int ptr_compare_2(size_t *compare_length,unsigned char **a, 
unsigned char **b)
-{
-  register int length= *compare_length-2;
-  register unsigned char *first,*last;
-
-  first= *a +2 ; last= *b +2;
-  cmp(-2);
-  cmp(-1);
- loop:
-  cmp(0);
-  cmp(1);
-  cmp(2);
-  cmp(3);
-  if ((length-=4))
-  {
-    first+=4;
-    last+=4;
-    goto loop;
+    if(*aa != *bb)
+    {
+      if(*aa < *bb)
+    return -1;
+      else
+    return 1;
+    }
+    aa++, bb++;
   }
-  return (0);
+  return 0;
+*/  return memcmp(*a, *b, *compare_length);
 }
-static int ptr_compare_3(size_t *compare_length,unsigned char **a, 
unsigned char **b)
-{
-  register int length= *compare_length-3;
-  register unsigned char *first,*last;
-
-  first= *a +3 ; last= *b +3;
-  cmp(-3);
-  cmp(-2);
-  cmp(-1);
- loop:
-  cmp(0);
-  cmp(1);
-  cmp(2);
-  cmp(3);
-  if ((length-=4))
-  {
-    first+=4;
-    last+=4;
-    goto loop;
-  }
-  return (0);
-}
void my_store_ptr(unsigned char *buff, size_t pack_length, my_off_t pos)
 {
For those interested, i started with the code at
http://gcc.gnu.org/ml/gcc/2002-10/msg01666.html
and ended up with something like this:
#include <sys/resource.h>
#include <sys/time.h>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
char s1[256];
char s2[256];
int cmpsize= 240;
int
speed_memcmp ()
{
  return memcmp (s1, s2, cmpsize);
}
int
speed_bimemcmp ()
{
  return __builtin_memcmp (s1, s2, cmpsize);
}
int
speed_loop ()
{
  int i=cmpsize;
  char *a= s1;
  char *b= s2;
  while(i--)
  {
    if(*a != *b)
      return -1;
    a++, b++;
  };
  return 0;
}
int
speed_loop32 ()
{
  int i=cmpsize/4;
  int *a= (int*)s1;
  int *b= (int*)s2;
  while(i--)
  {
    if(*a != *b)
      return -1;
    a++, b++;
  };
  return 0;
}
int
speed_loop64 ()
{
  int i=cmpsize/8;
  long long *a= (long long*)s1;
  long long *b= (long long*)s2;
  while(i--)
  {
    if(*a != *b)
      return -1;
    a++, b++;
  };
return 0;
}
int speed_null()
{
  return 0;
}
double
do_test (int repetitions, int (*test_function) ())
{
  struct rusage r1, r2;
getrusage(RUSAGE_SELF, &r1);
while (repetitions--)
    test_function ();
getrusage(RUSAGE_SELF, &r2);
  return (r2.ru_utime.tv_sec - r1.ru_utime.tv_sec) +
     (r2.ru_utime.tv_usec - r1.ru_utime.tv_usec) / 1000000.0;
}
main(int argc, char **argv)
{
  int repetitions = (argc == 1) ? 0x0fffffff : atoi(argv[1]);
  cmpsize = (argc == 2) ? 240 : atoi(argv[2]);
memset(s1, 42, 256);
  memset(s2, 42, 256);
  memset(s2+(cmpsize/2), 55, cmpsize/2);
printf ("%d repetitions\n\n", repetitions);
  printf ("Testing memcmp .....");
  fflush (stdout);
  printf (" done, %10.3f seconds\n", do_test (repetitions, 
speed_memcmp));
printf ("Testing builtin memcmp .......");
  fflush (stdout);
  printf (" done, %10.3f seconds\n", do_test (repetitions, 
speed_bimemcmp));
printf ("Testing loop ....");
  fflush (stdout);
  printf (" done, %10.3f seconds\n", do_test (repetitions, speed_loop));
printf ("Testing loop32 ....");
  fflush (stdout);
  printf (" done, %10.3f seconds\n", do_test (repetitions, 
speed_loop32));
printf ("Testing loop64 ....");
  fflush (stdout);
  printf (" done, %10.3f seconds\n", do_test (repetitions, 
speed_loop64));
printf ("Testing no-op ....");
  fflush (stdout);
  printf (" done, %10.3f seconds\n", do_test (repetitions, speed_null));
exit (0);
}


    

Re: [Maria-developers] [Drizzle-discuss] ptr_compare replace with memcmp and associated performance

Roy Lyseng