Why is malloc 7x slower than new for Intel's icc?

I benchmarked malloc vs. new for allocating an array of floats. My understanding was that the operations performed by malloc are a subset of the operations performed by new -- that malloc just allocates but new allocates and constructs, although I'm not sure if that's a meaningful difference for primitives.

Benchmarking results with gcc give the expected behavior. malloc() is faster. There are even SO questions asking the opposite of this one.

With icc malloc can be 7x slower than new. How is possible?!

Everything that follows is just details of the benchmarking procedure.

For benchmarking, I used a protocol recently described by Intel. Here are my results.

Clock cycles elapsed while allocating an array of 4000 floats with GNU's gcc:

new memory allocation, cycles            12168
malloc allocation, cycles                 5144

And with Intel's icc:

new    memory allocation clock cycles     7251
malloc memory allocation clock cycles    52372

How I'm using malloc:

volatile float* numbers = (float*)malloc(sizeof(float)*size);

How I'm using new:

volatile float* numbers = new float[size];

The volatile is there because in previous benchmarking attempts I've had issues with wily compilers optimizing away entire function calls and generating programs that just store constants. (The function implementations that the compiler chose to optimize in this way were indeed faster than the ones that it didn't!) I tried with the volatile removed just to be sure and the results were the same.

I sandwich the portion of code I want to benchmark between two macros.

Macro that comes before a function:

#define CYCLE_COUNT_START 
asm volatile ("CPUIDnt" 
"RDTSCnt" 
"mov %%edx, %0nt" 
"mov %%eax, %1nt": "=r" (cycles_high), "=r" (cycles_low):: 
"%rax", "%rbx", "%rcx", "%rdx");

Macro that comes after a function:

#define CYCLE_COUNT_END 
asm volatile("RDTSCPnt" 
"mov %%edx, %0nt" 
"mov %%eax, %1nt" 
"CPUIDnt": "=r" (cycles_high1), "=r" (cycles_low1):: 
"%rax", "%rbx", "%rcx", "%rdx"); 
start = ( ((uint64_t)cycles_high << 32) | cycles_low ); 
end = ( ((uint64_t)cycles_high1 << 32) | cycles_low1 ); 
ellapsed_cycles = end - start;

So the allocation call with the sandwiching macros for new is this:

CYCLE_COUNT_START
volatile float* numbers = new float[size];
CYCLE_COUNT_END

After that I check the value of ellapsed_cycles to see how everything went.

And just to be sure I'm not doing something silly, here's how I'm compiling with icc:

icc -O3 -ipo -no-prec-div -std=c++11 heap_version3.cpp           -o heap_version3
icc -O3 -ipo -no-prec-div -std=c++11 malloc_heap_version3.cpp    -o malloc_heap_version3

And with gcc:

g++-4.8 -Ofast -march=native -std=c++11 heap_version3.cpp        -o heap_version3
g++-4.8 -Ofast -march=native -std=c++11 malloc_heap_version3.cpp -o malloc_heap_version3

This is on a 2012 MacBook Pro with the corei7-avx instructions available. I have the 'as' binary swapped with a script that matches the one here so that gcc can use AVX instructions.

EDIT 1

To answer those who want to see more loop iterations, please skim the Intel link and then post. On the other hand, I would probably have the same reaction and so here are those loop iterations.

The array size is still 4000 and each run of the program still does a single allocation of memory. I didn't want to change what was being benchmarked by allocating a larger array that doesn't fit in L1 or repeatedly allocating and deallocating memory and raising other questions about memory. The program is run in a loop by bash. I run 4 separate programs for the benchmark, all 4 in every loop iteration in an effort to reduce heterogeneity due to other running processes.

for i in $(seq 1 10000); do
    echo gcc malloc $(./gcc_malloc_heap_version3 | head -n 1 | cut -d" " -f 4-)
    echo icc malloc $(./icc_malloc_heap_version3 | head -n 1 | cut -d" " -f 4-)
    echo gcc new $(./gcc_heap_version3 | head -n 1 | cut -d" " -f 4-)
    echo icc new $(./icc_heap_version3 | head -n 1 | cut -d" " -f 4-)
done

icc memory allocation timings:

       malloc       new
Min.   : 3093      1150
1st Qu.: 3729      1367
Median : 3891      1496
Mean   : 4015      1571
3rd Qu.: 4099      1636
Max.   :33231    183377

    Welch Two Sample t-test
    p-value < 2.2e-16

The observed difference is unlikely to have occurred by chance.

Density estimates for compilers and allocation methods:

在存储器分配期间经过的时钟周期的概率密度估计

The difference is now less dramatic but the order for icc is still the reverse of the expected.

EDIT 2

The results are nearly identical for an array of char. Because sizeof(int) gives me 4 and sizeof(char) gives me 1 I increased the array length to 16,000.

EDIT 3

source code and scripts

EDIT 4

Same data replotted as a timecourse for the first 100 allocations. 时钟周期每个内存分配的时间进程


It doesn't work like that. Processors and operating systems are complicated. You can't just make a single call taking a few microseconds and expect to get meaningful timing information. For starters, another app could use your CPU for a bit and RDTSC will continue counting.

链接地址: http://www.djcxy.com/p/6190.html

上一篇: 使用pre优化对数组的线性访问

下一篇: 为什么malloc 7x比英特尔icc的新版慢?