Unexpected performance with global variables

2018-06-08 23:24:47

I am getting a strange result using global variables. This question was inspired by another question. In the code below if I change

int ncols = 4096;

static int ncols = 4096;

const int ncols = 4096;

the code runs much faster and the assembly is much simpler.

//c99 -O3 -Wall -fopenmp foo.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int nrows = 4096;
int ncols = 4096;
//static int ncols = 4096;
char* buff;

void func(char* pbuff, int * _nrows, int * _ncols) {
    for (int i=0; i<*_nrows; i++) {
        for (int j=0; j<*_ncols; j++) {
            *pbuff += 1;
            pbuff++;
        }
    }
}

int main(void) {
    buff = calloc(ncols*nrows, sizeof*buff);
    double dtime = -omp_get_wtime();
    for(int k=0; k<100; k++) func(buff, &nrows, &ncols);
    dtime += omp_get_wtime();
    printf("time %.16en", dtime/100);
    return 0;
}

I also get the same result if char* buff is a automatic variable (ie not global or static ). I mean:

//c99 -O3 -Wall -fopenmp foo.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>

int nrows = 4096;
int ncols = 4096;

void func(char* pbuff, int * _nrows, int * _ncols) {
    for (int i=0; i<*_nrows; i++) {
        for (int j=0; j<*_ncols; j++) {
            *pbuff += 1;
            pbuff++;
        }
    }
}

int main(void) {
    char* buff = calloc(ncols*nrows, sizeof*buff);
    double dtime = -omp_get_wtime();
    for(int k=0; k<100; k++) func(buff, &nrows, &ncols);
    dtime += omp_get_wtime();
    printf("time %.16en", dtime/100);
    return 0;
}

If I change buff to be a short pointer then the performance is fast and does not depend on if ncols is static or constant of if buff is automatic. However, when I make buff an int* pointer I observe the same effect as char* .

I thought this may be due to pointer aliasing so I also tried

void func(int * restrict pbuff, int * restrict _nrows, int * restirct _ncols)

but it made no difference.

Here are my questions

When buff is either a char* pointer or a int* global pointer why is the code faster when ncols has file scope or is constant?

Why does buff being an automatic variable instead of global or static make the code faster?

Why does it make no difference when buff is a short pointer?

If this is due to pointer aliasing why does restrict have no noticeable effect?

Note that I'm using omp_get_wtime() simply because it's convenient for timing.

Some elements allow, as it's been written, GCC to assume different behaviors in terms of optimization; likely, the most impacting optimization we see is loop vectorization. Therefore,

Why is the code faster?

The code is faster because the hot part of it, the loops in func , have been optimized with auto-vectorization. In the case of a qualified ncols with static / const , indeed, GCC emits:

note: loop vectorized
note: loop peeled for vectorization to enhance alignment

which is visible if you turn on -fopt-info-loop , -fopt-info-vec or combinations of those with a further -optimized since it has the same effect.

Why does buff being an automatic variable instead of global or static make the code faster?

In this case, GCC is able to compute the number of iterations which is intuitively necessary to apply vectorization. This is again due to the storage of buf which is external if not specified otherwise. The whole vectorization is immediately skipped, unlike when buff is local where it carries on and succeeds.

Why does it make no difference when buff is a short pointer?

Why should it? func accepts a char* which may alias anything.

If this is due to pointer aliasing why does restrict have no noticeable effect?

I don't think because GCC can see that they don't alias when func is invoked: restrict isn't needed.

A const will most likely always yield faster or equally fast code as a read/write variable, since the compiler knows that the variable won't be changed, which in turn enables a whole lot of optimization options.

Declaring a file scope variable int or static int should not affect performance much, as it will still be allocated at the very same place: the .data section.

But as mentioned in comments, if the variable is global, the compiler might have to assume that some other file (translation unit) might modify it and therefore block some optimization. I suppose this is what's happening.

But this shouldn't be any concern anyhow, since there is never a reason to declare a global variable in C, period. Always declare them as static to prevent the variable from getting abused for spaghetti-coding purposes.

In general I'd also question your benchmarking results. In Windows you should be using QueryPerformanceCounter and similar. https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx

链接地址: http://www.djcxy.com/p/27030.html

上一篇: 在Jenkins上Behave test runner没有彩色输出

下一篇: 全局变量带来意想不到的表现