1,000,000,000 calculations per microsecond?

OK, I've been talking to a friend about compilers and optimisation of programs, and he suggested that n * 0.5 is faster than n / 2 . I said that compilers do that kind of optimisation automatically, so I wrote a small program to see if there was a difference between n / 2 and n * 0.5 :

Division:

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i / 2;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i / 2: %d calculations took %f seconds (last calculation = %f)n", m, s, n);

    return 0;
}

Multiplication:

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i * 0.5;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i * 0.5: %d calculations took %f seconds (last calculation = %f)n", m, s, n);

    return 0;
}

And for both versions I got 0.000002s avg. when compiled with clang main.c -O1 . And he said there must be something wrong with the time measurement. So he then wrote a program:

#include <cstdio>
#include <iostream>
#include <ctime>

using namespace std;

int main()
{
    clock_t ts, te;
    double  dT;

    int i, m;
    double n, o, p, q, r, s;
    m = 1000000000;

    cout << "Independent calculations:n";
    ts = clock();
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = 22.2 / 2.3;
        p = 33.3 / 2.3;
        q = 44.4 / 2.3;
        r = 55.5 / 2.3;
        s = 66.6 / 2.3;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f secondsn", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = 22.2 * 0.53;
        p = 33.3 * 0.53;
        q = 44.4 * 0.53;
        r = 55.5 * 0.53;
        s = 66.6 * 0.53;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f secondsn", m, dT);

    cout << "nDependent calculations:n";
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = n / 2.3;
        p = o / 2.3;
        q = p / 2.3;
        r = q / 2.3;
        s = r / 2.3;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f secondsn", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = n * 0.53;
        p = o * 0.53;
        q = p * 0.53;
        r = q * 0.53;
        s = r * 0.53;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f secondsn", m, dT);

    return 0;
}

And for that he got...

1.869570s
1.868254s
25.674016s
3.497555s

...in that order.

So I ran the program on my machine compiled with clang++ main.cpp -O1 and I got similar results as before: 0.000002 to 0.000011 .

However, when I compiled the program without optimisation, I got similar results to him on his first test. So my question is, how can any amount of optimisation make the program that much faster?


由于代码在循环的每次迭代中都没有做任何不同的事情,因此编译器可以自由地将循环内的代码移到外部(结果完全相同),并完全删除循环,使您几乎为0如你所见,运行时间。


This is a good example of how benchmarking a high level language is even harder than benchmarking assembly (which is hard enough to get right already). The compiler can, and often will, interfere with your benchmark.

Your friend has a point, a division (actual division, not just writing / in C) is slower than a multiplication. For doubles, the ratio is about 4 for the latency, and division isn't pipelined whereas multiplication is, so the throughput ratio is much worse: around 20. (these numbers are for Haswell, but are typical)

Integer division is slower than floating point division, but using floating point multiplication on integer causes two conversions. The conversions aren't too bad, so in total, the floating point multiplication is still faster.

But any proper compiler will turn integer division by a constant into integer multiplication and a shift, and maybe some extra fix-up stuff (depending on the divisor and the type). Division by a power of two is even simpler. For details, see Division by Invariant Integers using Multiplication. As an example, consider

int div2(int i)
{
    return i / 2;
}

GCC turns this into

mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret

Which depending on the µarch, would take 3 or 4 cycles (excluding control flow).

On the other hand,

int div2(int i)
{
    return i * 0.5;
}

Is turned into

    cvtsi2sd    xmm0, edi
    mulsd   xmm0, QWORD PTR .LC0[rip]
    cvttsd2si   eax, xmm0
    ret
.LC0:
    .long   0
    .long   1071644672

Which would take about 4+5+4 = 13 cycles.

A compiler is also allowed to turn f / 2.0 into f * 0.5 even without any "unsafe optimizations", division by a power of two is equivalent to multiplication by its inverse. That does not hold for numbers that are not a power of two.

So even with a benchmark that at least measured something, it probably wouldn't have measured the right thing. In order to measure the latency floating point division vs multiplication, you could write something like:

.section data
align 16
one: dq 1.0, 1.0
.section text
_bench1:
  mov ecx, -10000000
  movapd xmm0, [one]
loopone:
  mulpd xmm0, xmm0
  mulpd xmm0, xmm0
  add ecx, 1
  jnz loopone
  ret
_bench2:
  mov ecx, -10000000
  movapd xmm0, [one]
looptwo:
  divpd xmm0, xmm0
  divpd xmm0, xmm0
  add ecx, 1
  jnz looptwo
  ret

Call both a thousand, wrapped in rdtsc to get the time. Take the lowest time for both. Multiply the time by the ratio between your base clock and turbo clock. Then you should have (approximately) the number of cycles both loops take, divide by 20000000 to get the number of cycles per mulpd or divpd . The time division takes depends on the values being divided, so it doesn't give the most general result.

You may also be interested in a list of instruction latencies and throughputs.


for (i = 0; i < m; i++)
{
    //  make it a trivial pure float calculation with no int casting to float
    n = 11.1 * 0.53;
    o = n * 0.53;
    p = o * 0.53;
    q = p * 0.53;
    r = q * 0.53;
    s = r * 0.53;
}

是一个不引用im且不包含循环引用的循环,因此编译器除去循环语句是微不足道的

链接地址: http://www.djcxy.com/p/15026.html

上一篇: 我怎样才能优化这个计算? (x ^ a + y ^ a + z ^ a)^(1 / a)

下一篇: 1,000微妙计算每微秒?