From dean-list-lmbench-users@arctic.org Mon Apr 26 20:01:19 2004 From: dean gaudet To: lmbench-users@bitmover.com Subject: [Lmbench-users] [patch] lat_ops fp division results misleading Date: Mon, 26 Apr 2004 19:57:46 -0700 (PDT) it's not uncommon for hardware shortcircuits for various special division cases... division by 1 being one of the most obvious. the quick fix below uses 3.14159 in several cases... but i didn't do a full analysis to make sure it hasn't caused over/underflow. here are before / after timings in nanoseconds on two processors which i know to have the fdiv-by-1 shortcircuit, and one which doesn't: pentium-m 1GHz k8 1.8GHz p4-2 2.4GHz float div 23.78 / 38.47 9.78 / 13.38 17.92 / 17.96 double div 24.59 / 38.49 9.77 / 13.39 17.92 / 17.95 float bogomflops: 8.33 / 37.55 5.73 / 13.61 17.97 / 17.96 double bogomflops: 8.55 / 37.90 5.06 / 12.04 17.97 / 17.96 the new results correspond exactly with latencies i've measured for x87 80-bit precision divisions... (all x87 divisions are at fcw.pc precision and you need to change this global setting to see shorter latencies for 32-bit or 64-bit operations -- linux defaults to 80-bit, windows to 64-bit). if i recompile -msse2 -mfpmath=sse then i get numbers corresponding to sse/sse2 latencies i've measured through other means. note the bogomflops benchmark isn't all that useful on its own -- the division in there basically means that the benchmark proves whether or not the hardware is capable of overlapping an fp division with other operations. this is definitely interesting information -- as you can see above these processors are both capable of overlapping division completely (or nearly completely) with other operations, and the division is the dominating cost. but a critical benchmark is missing -- fp muladd. in general it's very interesting to know how well a processor does on a balanced sequence of fp multiplications and adds (i.e. think polynomial expansion/approximation, matrix multiply, dot product, ...) (the processors above have quite different capabilities for pairing x87, sse, and sse2 muls and adds). -dean --- lat_ops.c.orig 2003-01-13 03:16:13.000000000 -0800 +++ lat_ops.c 2004-04-26 19:21:45.000000000 -0700 @@ -187,7 +187,7 @@ do_float_div(iter_t iterations, void* cookie) { struct _state *pState = (struct _state*)cookie; - register float f = (float)pState->N; + register float f = 3.14159*(float)pState->N; register float g = (float)pState->M; while (iterations-- > 0) { @@ -240,7 +240,7 @@ do_double_div(iter_t iterations, void* cookie) { struct _state *pState = (struct _state*)cookie; - register double f = (double)pState->N; + register double f = 3.14159*(double)pState->N; register double g = (double)pState->M; while (iterations-- > 0) { @@ -264,7 +264,7 @@ pState->data = (double*)x; for (i = 0; i < pState->M; ++i) { - x[i] = 1.; + x[i] = 3.14159; } } @@ -276,7 +276,7 @@ pState->data = (double*)malloc(pState->M * sizeof(double)); for (i = 0; i < pState->M; ++i) { - pState->data[i] = 1.; + pState->data[i] = 3.14159; } } _______________________________________________ Lmbench-users mailing list Lmbench-users@bitmover.com http://bitmover.com/mailman/listinfo/lmbench-users