Why is malloc 7x slower than new for Intel's icc?
I benchmarked malloc vs. new for allocating an array of floats. My understanding was that the operations performed by malloc are a subset of the operations performed by new -- that malloc just allocates but new allocates and constructs, although I'm not sure if that's a meaningful difference for primitives.
Benchmarking results with gcc give the expected behavior. malloc() is faster. There are even SO questions asking the opposite of this one.
With icc malloc can be 7x slower than new. How is possible?!
Everything that follows is just details of the benchmarking procedure.
For benchmarking, I used a protocol recently described by Intel. Here are my results.
Clock cycles elapsed while allocating an array of 4000 floats with GNU's gcc:
new memory allocation, cycles 12168
malloc allocation, cycles 5144
And with Intel's icc:
new memory allocation clock cycles 7251
malloc memory allocation clock cycles 52372
How I'm using malloc:
volatile float* numbers = (float*)malloc(sizeof(float)*size);
How I'm using new:
volatile float* numbers = new float[size];
The volatile is there because in previous benchmarking attempts I've had issues with wily compilers optimizing away entire function calls and generating programs that just store constants. (The function implementations that the compiler chose to optimize in this way were indeed faster than the ones that it didn't!) I tried with the volatile removed just to be sure and the results were the same.
I sandwich the portion of code I want to benchmark between two macros.
Macro that comes before a function:
#define CYCLE_COUNT_START
asm volatile ("CPUIDnt"
"RDTSCnt"
"mov %%edx, %0nt"
"mov %%eax, %1nt": "=r" (cycles_high), "=r" (cycles_low)::
"%rax", "%rbx", "%rcx", "%rdx");
Macro that comes after a function:
#define CYCLE_COUNT_END
asm volatile("RDTSCPnt"
"mov %%edx, %0nt"
"mov %%eax, %1nt"
"CPUIDnt": "=r" (cycles_high1), "=r" (cycles_low1)::
"%rax", "%rbx", "%rcx", "%rdx");
start = ( ((uint64_t)cycles_high << 32) | cycles_low );
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 );
ellapsed_cycles = end - start;
So the allocation call with the sandwiching macros for new is this:
CYCLE_COUNT_START
volatile float* numbers = new float[size];
CYCLE_COUNT_END
After that I check the value of ellapsed_cycles to see how everything went.
And just to be sure I'm not doing something silly, here's how I'm compiling with icc:
icc -O3 -ipo -no-prec-div -std=c++11 heap_version3.cpp -o heap_version3
icc -O3 -ipo -no-prec-div -std=c++11 malloc_heap_version3.cpp -o malloc_heap_version3
And with gcc:
g++-4.8 -Ofast -march=native -std=c++11 heap_version3.cpp -o heap_version3
g++-4.8 -Ofast -march=native -std=c++11 malloc_heap_version3.cpp -o malloc_heap_version3
This is on a 2012 MacBook Pro with the corei7-avx instructions available. I have the 'as' binary swapped with a script that matches the one here so that gcc can use AVX instructions.
EDIT 1
To answer those who want to see more loop iterations, please skim the Intel link and then post. On the other hand, I would probably have the same reaction and so here are those loop iterations.
The array size is still 4000 and each run of the program still does a single allocation of memory. I didn't want to change what was being benchmarked by allocating a larger array that doesn't fit in L1 or repeatedly allocating and deallocating memory and raising other questions about memory. The program is run in a loop by bash. I run 4 separate programs for the benchmark, all 4 in every loop iteration in an effort to reduce heterogeneity due to other running processes.
for i in $(seq 1 10000); do
echo gcc malloc $(./gcc_malloc_heap_version3 | head -n 1 | cut -d" " -f 4-)
echo icc malloc $(./icc_malloc_heap_version3 | head -n 1 | cut -d" " -f 4-)
echo gcc new $(./gcc_heap_version3 | head -n 1 | cut -d" " -f 4-)
echo icc new $(./icc_heap_version3 | head -n 1 | cut -d" " -f 4-)
done
icc memory allocation timings:
malloc new
Min. : 3093 1150
1st Qu.: 3729 1367
Median : 3891 1496
Mean : 4015 1571
3rd Qu.: 4099 1636
Max. :33231 183377
Welch Two Sample t-test
p-value < 2.2e-16
The observed difference is unlikely to have occurred by chance.
Density estimates for compilers and allocation methods:
The difference is now less dramatic but the order for icc is still the reverse of the expected.
EDIT 2
The results are nearly identical for an array of char. Because sizeof(int) gives me 4 and sizeof(char) gives me 1 I increased the array length to 16,000.
EDIT 3
source code and scripts
EDIT 4
Same data replotted as a timecourse for the first 100 allocations.
It doesn't work like that. Processors and operating systems are complicated. You can't just make a single call taking a few microseconds and expect to get meaningful timing information. For starters, another app could use your CPU for a bit and RDTSC will continue counting.
链接地址: http://www.djcxy.com/p/6190.html上一篇: 使用pre优化对数组的线性访问