What's the relative speed of floating point add vs. floating point multiply

A decade or two ago, it was worthwhile to write numerical code to avoid using multiplies and divides and use addition and subtraction instead. A good example is using forward differences to evaluate a polynomial curve instead of computing the polynomial directly.

Is this still the case, or have modern computer architectures advanced to the point where *,/ are no longer many times slower than +,- ?

To be specific, I'm interested in compiled C/C++ code running on modern typical x86 chips with extensive on-board floating point hardware, not a small micro trying to do FP in software. I realize pipelining and other architectural enhancements preclude specific cycle counts, but I'd still like to get a useful intuition.


It also depends on instruction mix. Your processor will have several computation units standing by at any time, and you'll get maximum throughput if all of them are filled all the time. So, executing a loop of mul's is just as fast as executing a loop or adds - but the same doesn't hold if the expression becomes more complex.

For example, take this loop:

for(int j=0;j<NUMITER;j++) {
  for(int i=1;i<NUMEL;i++) {
    bla += 2.1 + arr1[i] + arr2[i] + arr3[i] + arr4[i] ;
  }
}

for NUMITER=10^7, NUMEL=10^2, both arrays initialized to small positive numbers (NaN is much slower), this takes 6.0 seconds using doubles on a 64-bit proc. If I replace the loop with

bla += 2.1 * arr1[i] + arr2[i] + arr3[i] * arr4[i] ;

It only takes 1.7 seconds... so since we "overdid" the additions, the muls were essentially free; and the reduction in additions helped. It get's more confusing:

bla += 2.1 + arr1[i] * arr2[i] + arr3[i] * arr4[i] ;

-- same mul/add distribution, but now the constant is added in rather than multiplied in -- takes 3.7 seconds. Your processor is likely optimized to perform typical numerical computations more efficiently; so dot-product like sums of muls and scaled sums are about as good as it gets; adding constants isn't nearly as common, so that's slower...

bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; /*someval == 2.1*/

again takes 1.7 seconds.

bla += someval + arr1[i] + arr2[i] + arr3[i] + arr4[i] ; /*someval == 2.1*/

(same as initial loop, but without expensive constant addition: 2.1 seconds)

bla += someval * arr1[i] * arr2[i] * arr3[i] * arr4[i] ; /*someval == 2.1*/

(mostly muls, but one addition:1.9 seconds)

So, basically; it's hard to say which is faster, but if you wish to avoid bottlenecks, more important is to have a sane mix, avoid NaN or INF, avoid adding constants. Whatever you do, make sure you test, and test various compiler settings, since often small changes can just make the difference.

Some more cases:

bla *= someval; // someval very near 1.0; takes 2.1 seconds
bla *= arr1[i] ;// arr1[i] all very near 1.0; takes 66(!) seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; // 1.6 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, 2.2 seconds
bla += someval + arr1[i] * arr2[i] + arr3[i] * arr4[i] ; //32-bit mode, floats 2.2 seconds
bla += someval * arr1[i]* arr2[i];// 0.9 in x64, 1.6 in x86
bla += someval * arr1[i];// 0.55 in x64, 0.8 in x86
bla += arr1[i] * arr2[i];// 0.8 in x64, 0.8 in x86, 0.95 in CLR+x64, 0.8 in CLR+x86

In theory the information is here:

Intel®64 and IA-32 Architectures Optimization Reference Manual, APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT

For every processor they list, the latency on FMUL is very close to that of FADD or FDIV. On some of the older processors, FDIV is 2-3 time slower than that, while on newer processors, it's the same as FMUL.

Caveats:

  • The document I linked actually says you can't rely on these numbers in real life since the processor will do what it wishes to make things faster if it's correct.

  • There's a good chance your compiler will decide to use one of the many newer instruction sets that have a floating-point multiply / divide available.

  • This is a complicated document only meant to be read by compiler writers and I might have gotten it wrong. Like I'm not clear why the FDIV latency number is completely missing for some of the CPUs.


  • The best way to answer this question is to actually write a benchmark/profile of the processing you need to do. Empirical should be used over theoretical when ever possible. Especially when it easy to attain.

    If you already know different implementations of the Math you need to do, you could write aa few different code transfermations of the math and see where your performance peaks. This will allow the processor/compiler to generate different execution streams to fill the processor pipelines and give you a concrete answer to your answer.

    If you are interest in specifically the performance of DIV/MUL/ADD/SUB type instructions you could even toss in some inline assembly to control specifically which variants of these instruction are executed. However you need to make sure you're keeping multilple execution units busy to get a good idea of the performance the system is capable of.

    Also doing something like this would allow you to compare performance on multiple variations of the processor by simply running the same program on them, and could also allow you to factor in the motherboard differences.

    Edit:

    Basic architecture of a +- is identical. So they logically take the same time to compute. * on the other hand, require multiple layers, typically constructed out of "full adders" to complete a single operation. This garentees that while a * can be issued to the pipeline every cycle it will have a higher latency than an add/subtract circuit. A fp / operation is typically implemented using an approximation method which iteratively converges towards the correct answer over time. These types of approximations are typically implemented via multiplication. So for floating point you can generally assume division will take longer because it's impractical to "unroll" the multiplications (which is already a large circuit in and of it's self) into pipeline of a multitude of multiplier circuits. Still the performance of a given system is best measured via testing.

    链接地址: http://www.djcxy.com/p/85648.html

    上一篇: 拉夫森分部为浮点分裂?

    下一篇: 浮点加法与浮点乘法的相对速度是多少?