Unexpected performance with global variables
I am getting a strange result using global variables. This question was inspired by another question. In the code below if I change
int ncols = 4096;
to
static int ncols = 4096;
or
const int ncols = 4096;
the code runs much faster and the assembly is much simpler.
//c99 -O3 -Wall -fopenmp foo.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int nrows = 4096;
int ncols = 4096;
//static int ncols = 4096;
char* buff;
void func(char* pbuff, int * _nrows, int * _ncols) {
for (int i=0; i<*_nrows; i++) {
for (int j=0; j<*_ncols; j++) {
*pbuff += 1;
pbuff++;
}
}
}
int main(void) {
buff = calloc(ncols*nrows, sizeof*buff);
double dtime = -omp_get_wtime();
for(int k=0; k<100; k++) func(buff, &nrows, &ncols);
dtime += omp_get_wtime();
printf("time %.16en", dtime/100);
return 0;
}
I also get the same result if char* buff
is a automatic variable (ie not global
or static
). I mean:
//c99 -O3 -Wall -fopenmp foo.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
int nrows = 4096;
int ncols = 4096;
void func(char* pbuff, int * _nrows, int * _ncols) {
for (int i=0; i<*_nrows; i++) {
for (int j=0; j<*_ncols; j++) {
*pbuff += 1;
pbuff++;
}
}
}
int main(void) {
char* buff = calloc(ncols*nrows, sizeof*buff);
double dtime = -omp_get_wtime();
for(int k=0; k<100; k++) func(buff, &nrows, &ncols);
dtime += omp_get_wtime();
printf("time %.16en", dtime/100);
return 0;
}
If I change buff
to be a short pointer then the performance is fast and does not depend on if ncols
is static or constant of if buff
is automatic. However, when I make buff
an int*
pointer I observe the same effect as char*
.
I thought this may be due to pointer aliasing so I also tried
void func(int * restrict pbuff, int * restrict _nrows, int * restirct _ncols)
but it made no difference.
Here are my questions
buff
is either a char*
pointer or a int*
global pointer why is the code faster when ncols
has file scope or is constant? buff
being an automatic variable instead of global or static make the code faster? buff
is a short pointer? restrict
have no noticeable effect? Note that I'm using omp_get_wtime()
simply because it's convenient for timing.
Some elements allow, as it's been written, GCC to assume different behaviors in terms of optimization; likely, the most impacting optimization we see is loop vectorization. Therefore,
Why is the code faster?
The code is faster because the hot part of it, the loops in func
, have been optimized with auto-vectorization. In the case of a qualified ncols
with static
/ const
, indeed, GCC emits:
note: loop vectorized
note: loop peeled for vectorization to enhance alignment
which is visible if you turn on -fopt-info-loop
, -fopt-info-vec
or combinations of those with a further -optimized
since it has the same effect.
In this case, GCC is able to compute the number of iterations which is intuitively necessary to apply vectorization. This is again due to the storage of buf
which is external if not specified otherwise. The whole vectorization is immediately skipped, unlike when buff
is local where it carries on and succeeds.
Why should it? func
accepts a char*
which may alias anything.
I don't think because GCC can see that they don't alias when func
is invoked: restrict
isn't needed.
A const
will most likely always yield faster or equally fast code as a read/write variable, since the compiler knows that the variable won't be changed, which in turn enables a whole lot of optimization options.
Declaring a file scope variable int
or static int
should not affect performance much, as it will still be allocated at the very same place: the .data
section.
But as mentioned in comments, if the variable is global, the compiler might have to assume that some other file (translation unit) might modify it and therefore block some optimization. I suppose this is what's happening.
But this shouldn't be any concern anyhow, since there is never a reason to declare a global variable in C, period. Always declare them as static
to prevent the variable from getting abused for spaghetti-coding purposes.
In general I'd also question your benchmarking results. In Windows you should be using QueryPerformanceCounter and similar. https://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx
链接地址: http://www.djcxy.com/p/27030.html上一篇: 在Jenkins上Behave test runner没有彩色输出
下一篇: 全局变量带来意想不到的表现