assembly - Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

ID : 365

viewed : 154

Tags : gccassemblyfloating-pointcompiler-optimizationfast-mathgcc





Top 5 Answer for assembly - Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

vote vote

94

Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don't care about numerical accuracy. For example: the -fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.

vote vote

80

Lambdageek correctly points out that because associativity does not hold for floating-point numbers, the "optimization" of a*a*a*a*a*a to (a*a*a)*(a*a*a) may change the value. This is why it is disallowed by C99 (unless specifically allowed by the user, via compiler flag or pragma). Generally, the assumption is that the programmer wrote what she did for a reason, and the compiler should respect that. If you want (a*a*a)*(a*a*a), write that.

That can be a pain to write, though; why can't the compiler just do [what you consider to be] the right thing when you use pow(a,6)? Because it would be the wrong thing to do. On a platform with a good math library, pow(a,6) is significantly more accurate than either a*a*a*a*a*a or (a*a*a)*(a*a*a). Just to provide some data, I ran a small experiment on my Mac Pro, measuring the worst error in evaluating a^6 for all single-precision floating numbers between [1,2):

worst relative error using    powf(a, 6.f): 5.96e-08 worst relative error using (a*a*a)*(a*a*a): 2.94e-07 worst relative error using     a*a*a*a*a*a: 2.58e-07 

Using pow instead of a multiplication tree reduces the error bound by a factor of 4. Compilers should not (and generally do not) make "optimizations" that increase error unless licensed to do so by the user (e.g. via -ffast-math).

Note that GCC provides __builtin_powi(x,n) as an alternative to pow( ), which should generate an inline multiplication tree. Use that if you want to trade off accuracy for performance, but do not want to enable fast-math.

vote vote

78

Another similar case: most compilers won't optimize a + b + c + d to (a + b) + (c + d) (this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as (((a + b) + c) + d)). This too is because of corner cases:

float a = 1e35, b = 1e-5, c = -1e35, d = 1e-5; printf("%e %e\n", a + b + c + d, (a + b) + (c + d)); 

This outputs 1.000000e-05 0.000000e+00

vote vote

69

Fortran (designed for scientific computing) has a built-in power operator, and as far as I know Fortran compilers will commonly optimize raising to integer powers in a similar fashion to what you describe. C/C++ unfortunately don't have a power operator, only the library function pow(). This doesn't prevent smart compilers from treating pow specially and computing it in a faster way for special cases, but it seems they do it less commonly ...

Some years ago I was trying to make it more convenient to calculate integer powers in an optimal way, and came up with the following. It's C++, not C though, and still depends on the compiler being somewhat smart about how to optimize/inline things. Anyway, hope you might find it useful in practice:

template<unsigned N> struct power_impl;  template<unsigned N> struct power_impl {     template<typename T>     static T calc(const T &x) {         if (N%2 == 0)             return power_impl<N/2>::calc(x*x);         else if (N%3 == 0)             return power_impl<N/3>::calc(x*x*x);         return power_impl<N-1>::calc(x)*x;     } };  template<> struct power_impl<0> {     template<typename T>     static T calc(const T &) { return 1; } };  template<unsigned N, typename T> inline T power(const T &x) {     return power_impl<N>::calc(x); } 

Clarification for the curious: this does not find the optimal way to compute powers, but since finding the optimal solution is an NP-complete problem and this is only worth doing for small powers anyway (as opposed to using pow), there's no reason to fuss with the detail.

Then just use it as power<6>(a).

This makes it easy to type powers (no need to spell out 6 as with parens), and lets you have this kind of optimization without -ffast-math in case you have something precision dependent such as compensated summation (an example where the order of operations is essential).

You can probably also forget that this is C++ and just use it in the C program (if it compiles with a C++ compiler).

Hope this can be useful.

EDIT:

This is what I get from my compiler:

For a*a*a*a*a*a,

    movapd  %xmm1, %xmm0     mulsd   %xmm1, %xmm0     mulsd   %xmm1, %xmm0     mulsd   %xmm1, %xmm0     mulsd   %xmm1, %xmm0     mulsd   %xmm1, %xmm0 

For (a*a*a)*(a*a*a),

    movapd  %xmm1, %xmm0     mulsd   %xmm1, %xmm0     mulsd   %xmm1, %xmm0     mulsd   %xmm0, %xmm0 

For power<6>(a),

    mulsd   %xmm0, %xmm0     movapd  %xmm0, %xmm1     mulsd   %xmm0, %xmm1     mulsd   %xmm0, %xmm1 
vote vote

50

GCC does actually optimize a*a*a*a*a*a to (a*a*a)*(a*a*a) when a is an integer. I tried with this command:

$ echo 'int f(int x) { return x*x*x*x*x*x; }' | gcc -o - -O2 -S -masm=intel -x c - 

There are a lot of gcc flags but nothing fancy. They mean: Read from stdin; use O2 optimization level; output assembly language listing instead of a binary; the listing should use Intel assembly language syntax; the input is in C language (usually language is inferred from input file extension, but there is no file extension when reading from stdin); and write to stdout.

Here's the important part of the output. I've annotated it with some comments indicating what's going on in the assembly language:

; x is in edi to begin with.  eax will be used as a temporary register. mov  eax, edi  ; temp = x imul eax, edi  ; temp = x * temp imul eax, edi  ; temp = x * temp imul eax, eax  ; temp = temp * temp 

I'm using system GCC on Linux Mint 16 Petra, an Ubuntu derivative. Here's the gcc version:

$ gcc --version gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1 

As other posters have noted, this option is not possible in floating point, because floating point arithmetic is not associative.

Top 3 video Explaining assembly - Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?







Related QUESTION?