Why is this totaling operation faster on the stack than the heap?
In the below C# program compiled in Visual Studio 2015 Update 2 x64 Release mode on a Broadwell CPU and Windows 8.1, two variants of a benchmark are run. They both do the same thing -- total five million integers in an array.
The difference between the two benchmarks is that one version keeps the running total (a single long) on the stack, and the other keeps it on the heap. No allocation takes place in either version; a total is added to while scanning along an array.
In testing I'm seeing a consistent significant performance difference between the benchmark variant with the total on the heap and the one on the stack. With some test sizes it's three times slower when the total is on the heap.
Why is there such a performance disparity between the two memory locations for the total?
using System;
using System.Diagnostics;
namespace StackHeap
{
class StackvHeap
{
static void Main(string[] args)
{
double stackAvgms, heapAvgms;
// Warmup
runBenchmark(out stackAvgms, out heapAvgms);
// Run
runBenchmark(out stackAvgms, out heapAvgms);
Console.WriteLine($"Stack avg: {stackAvgms} msnHeap avg: {heapAvgms} ms");
}
private static void runBenchmark(out double stackAvgms, out double heapAvgms)
{
Benchmarker b = new Benchmarker();
long stackTotalms = 0;
int trials = 100;
for (int i = 0; i < trials; ++i)
{
stackTotalms += b.stackTotaler();
}
long heapTotalms = 0;
for (int i = 0; i < trials; ++i)
{
heapTotalms += b.heapTotaler();
}
stackAvgms = stackTotalms / (double)trials;
heapAvgms = heapTotalms / (double)trials;
}
}
class Benchmarker
{
long heapTotal;
int[] vals = new int[5000000];
public long heapTotaler()
{
setup();
var stopWatch = new Stopwatch();
stopWatch.Start();
for (int i = 0; i < vals.Length; ++i)
{
heapTotal += vals[i];
}
stopWatch.Stop();
//Console.WriteLine($"{stopWatch.ElapsedMilliseconds} milliseconds with the counter on the heap");
return stopWatch.ElapsedMilliseconds;
}
public long stackTotaler()
{
setup();
var stopWatch = new Stopwatch();
stopWatch.Start();
long stackTotal = 0;
for (int i = 0; i < vals.Length; ++i)
{
stackTotal += vals[i];
}
stopWatch.Stop();
//Console.WriteLine($"{stopWatch.ElapsedMilliseconds} milliseconds with the counter on the stack");
return stopWatch.ElapsedMilliseconds;
}
private void setup()
{
heapTotal = 0;
for (int i = 0; i < vals.Length; ++i)
{
vals[i] = i;
}
}
}
}
With some test sizes it's three times slower
That's the only cue towards the underlying problem. If you care about perf for long variables then do not use the x86 jitter. Alignment is critical and you can't get a good enough alignment guarantee in 32-bit mode.
The CLR can then only align to 4, that gives such a test 3 distinct outcomes. The variable can be aligned to 8, the fast version. And misaligned to 4 within a cache line, about 2x slower. And misaligned to 4 and straddling the L1 cache line boundary, about 3x slower. Same kind of problem with double btw.
Use Project > Properties > Build tab > untick the "Prefer 32-bit mode" checkbox. Just in case, use Tools > Options > Debugging > General > untick the "Suppress JIT optimization". Tweak the benchmark code, put a for-loop around the code, I always run it at least 10 times. Select the Release mode configuration and run the tests again.
You now have a completely different question, probably more along your expectations. Yes, local variables are not volatile by default, fields are. Having to update heapTotal inside the loop is the overhead you see.
This is from the heapTotaller
disassembly:
heapTotal = 0;
000007FE99F34966 xor ecx,ecx
000007FE99F34968 mov qword ptr [rsi+10h],rcx
for (int i = 0; i < vals.Length; ++i)
000007FE99F3496C mov rax,qword ptr [rsi+8]
000007FE99F34970 mov edx,dword ptr [rax+8]
000007FE99F34973 test edx,edx
000007FE99F34975 jle 000007FE99F34993
{
heapTotal += vals[i];
000007FE99F34977 mov r8,rax
000007FE99F3497A cmp ecx,edx
000007FE99F3497C jae 000007FE99F349C8
000007FE99F3497E movsxd r9,ecx
000007FE99F34981 mov r8d,dword ptr [r8+r9*4+10h]
000007FE99F34986 movsxd r8,r8d
000007FE99F34989 add qword ptr [rsi+10h],r8
You can see it's using [rsi+10h]
for the heapTotal
variable.
And this is from the stackTotaller
:
long stackTotal = 0;
000007FE99F3427A xor ecx,ecx
for (int i = 0; i < vals.Length; ++i)
000007FE99F3427C xor eax,eax
000007FE99F3427E mov rdx,qword ptr [rsi+8]
000007FE99F34282 mov r8d,dword ptr [rdx+8]
000007FE99F34286 test r8d,r8d
000007FE99F34289 jle 000007FE99F342A8
{
stackTotal += vals[i];
000007FE99F3428B mov r9,rdx
000007FE99F3428E cmp eax,r8d
000007FE99F34291 jae 000007FE99F342DD
000007FE99F34293 movsxd r10,eax
000007FE99F34296 mov r9d,dword ptr [r9+r10*4+10h]
000007FE99F3429B movsxd r9,r9d
000007FE99F3429E add rcx,r9
You can see the JIT has optimized the code: it's using the RCX
register for heapTotal
.
Registers are faster than memory access, hence the speed improvement.
链接地址: http://www.djcxy.com/p/2236.html上一篇: 深拷贝和浅拷贝之间有什么区别?
下一篇: 为什么堆中的总计操作比堆更快?