1,000微妙计算每微秒？

2018-06-04 15:20:13

好的，我一直在和一位朋友谈论编译器和程序优化，他建议n * 0.5比n / 2更快。我说编译器会自动进行这种优化，所以我编写了一个小程序来查看n / 2和n * 0.5之间是否有差别：

师：

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i / 2;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i / 2: %d calculations took %f seconds (last calculation = %f)n", m, s, n);

    return 0;
}

乘法：

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i * 0.5;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i * 0.5: %d calculations took %f seconds (last calculation = %f)n", m, s, n);

    return 0;
}

对于这两个版本，我平均得到0.000002s。编译时使用clang main.c -O1 。他说时间测量一定有什么问题。于是他写了一个程序：

#include <cstdio>
#include <iostream>
#include <ctime>

using namespace std;

int main()
{
    clock_t ts, te;
    double  dT;

    int i, m;
    double n, o, p, q, r, s;
    m = 1000000000;

    cout << "Independent calculations:n";
    ts = clock();
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = 22.2 / 2.3;
        p = 33.3 / 2.3;
        q = 44.4 / 2.3;
        r = 55.5 / 2.3;
        s = 66.6 / 2.3;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f secondsn", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = 22.2 * 0.53;
        p = 33.3 * 0.53;
        q = 44.4 * 0.53;
        r = 55.5 * 0.53;
        s = 66.6 * 0.53;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f secondsn", m, dT);

    cout << "nDependent calculations:n";
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = n / 2.3;
        p = o / 2.3;
        q = p / 2.3;
        r = q / 2.3;
        s = r / 2.3;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f secondsn", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = n * 0.53;
        p = o * 0.53;
        q = p * 0.53;
        r = q * 0.53;
        s = r * 0.53;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f secondsn", m, dT);

    return 0;
}

为此他得到了......

...以该顺序。

所以我在我的机器上运行该程序，并使用clang++ main.cpp -O1 0.000002 to 0.000011编译，并得到与以前相似的结果： 0.000002 to 0.000011 。

然而，当我没有优化编译这个程序时，我在他的第一次测试中得到了类似的结果。所以我的问题是，任何数量的优化如何使程序更快？

由于代码在循环的每次迭代中都没有做任何不同的事情，因此编译器可以自由地将循环内的代码移到外部（结果完全相同），并完全删除循环，使您几乎为0如你所见，运行时间。

这是一个很好的例子，说明如何使用高级语言进行基准测试比基准测试程序更难（这已经很难达到正确的水平）。编译器可以并经常会干扰您的基准测试。

你的朋友有一个观点，一个部门（实际部门，不只是写/在C）比一个乘法慢。对于双打，延迟的比例大约是4，分割不流水线，而乘法是，所以吞吐率更差：大约20（这些数字是Haswell，但是是典型的）

整数除法比浮点除法要慢，但在整数上使用浮点乘法会导致两次转换。转换并不算太糟，所以总的来说，浮点乘法运算速度仍然较快。

但是任何合适的编译器都会将整数除以一个常量，然后转换为整数乘法和一个移位，也许会有一些额外的修正（取决于除数和类型）。用两个幂来划分更简单。有关详细信息，请参阅使用乘法的不变整数除法。作为一个例子，考虑一下

int div2(int i)
{
    return i / 2;
}

GCC把它变成

mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret

取决于μarch，需要3或4个周期（不包括控制流量）。

另一方面，

int div2(int i)
{
    return i * 0.5;
}

变成了

    cvtsi2sd    xmm0, edi
    mulsd   xmm0, QWORD PTR .LC0[rip]
    cvttsd2si   eax, xmm0
    ret
.LC0:
    .long   0
    .long   1071644672

这将需要大约4 + 5 + 4 = 13个周期。

即使没有任何“不安全的优化”，编译器也被允许将f / 2.0转换为f * 0.5 ，除以2的幂等于乘以其倒数。这不适用于不是两次幂的数字。

所以即使有了一个基准，至少可以测量一些东西，但它可能不会衡量正确的事情。为了测量延迟浮点除法与乘法，你可以写下如下内容：

.section data
align 16
one: dq 1.0, 1.0
.section text
_bench1:
  mov ecx, -10000000
  movapd xmm0, [one]
loopone:
  mulpd xmm0, xmm0
  mulpd xmm0, xmm0
  add ecx, 1
  jnz loopone
  ret
_bench2:
  mov ecx, -10000000
  movapd xmm0, [one]
looptwo:
  divpd xmm0, xmm0
  divpd xmm0, xmm0
  add ecx, 1
  jnz looptwo
  ret

打电话给千寻，包裹rdtsc来获得时间。以两者的最低时间。乘以你的基本时钟和turbo时钟之间的时间乘以时间。那么你应该有（大约）两个循环所需的循环次数，除以20000000以获得每个mulpd或divpd的循环次数。时间分割取决于被分割的值，所以它不会给出最一般的结果。

您也可能对指令延迟和吞吐量列表感兴趣。

for (i = 0; i < m; i++)
{
    //  make it a trivial pure float calculation with no int casting to float
    n = 11.1 * 0.53;
    o = n * 0.53;
    p = o * 0.53;
    q = p * 0.53;
    r = q * 0.53;
    s = r * 0.53;
}

是一个不引用i或m且不包含循环引用的循环，因此编译器除去循环语句是微不足道的

链接地址: http://www.djcxy.com/p/15025.html

上一篇: 1,000,000,000 calculations per microsecond?

下一篇: How raise to power works? Is it worth to use pow(x, 2)?