Why is this totaling operation faster on the stack than the heap?

In the below C# program compiled in Visual Studio 2015 Update 2 x64 Release mode on a Broadwell CPU and Windows 8.1, two variants of a benchmark are run. They both do the same thing -- total five million integers in an array.

The difference between the two benchmarks is that one version keeps the running total (a single long) on the stack, and the other keeps it on the heap. No allocation takes place in either version; a total is added to while scanning along an array.

In testing I'm seeing a consistent significant performance difference between the benchmark variant with the total on the heap and the one on the stack. With some test sizes it's three times slower when the total is on the heap.

Why is there such a performance disparity between the two memory locations for the total?

using System;
using System.Diagnostics;

namespace StackHeap
{
    class StackvHeap
    {
        static void Main(string[] args)
        {
            double stackAvgms, heapAvgms;

            // Warmup
            runBenchmark(out stackAvgms, out heapAvgms);

            // Run
            runBenchmark(out stackAvgms, out heapAvgms);

            Console.WriteLine($"Stack avg: {stackAvgms} msnHeap avg: {heapAvgms} ms");
        }

        private static void runBenchmark(out double stackAvgms, out double heapAvgms)
        {
            Benchmarker b = new Benchmarker();
            long stackTotalms = 0;
            int trials = 100;
            for (int i = 0; i < trials; ++i)
            {
                stackTotalms += b.stackTotaler();
            }
            long heapTotalms = 0;
            for (int i = 0; i < trials; ++i)
            {
                heapTotalms += b.heapTotaler();
            }

            stackAvgms = stackTotalms / (double)trials;
            heapAvgms = heapTotalms / (double)trials;
        }
    }

    class Benchmarker
    {
        long heapTotal;
        int[] vals = new int[5000000];

        public long heapTotaler()
        {
            setup();
            var stopWatch = new Stopwatch();
            stopWatch.Start();

            for (int i = 0; i < vals.Length; ++i)
            {
                heapTotal += vals[i];
            }
            stopWatch.Stop();
            //Console.WriteLine($"{stopWatch.ElapsedMilliseconds} milliseconds with the counter on the heap");
            return stopWatch.ElapsedMilliseconds;
        }

        public long stackTotaler()
        {
            setup();
            var stopWatch = new Stopwatch();
            stopWatch.Start();

            long stackTotal = 0;
            for (int i = 0; i < vals.Length; ++i)
            {
                stackTotal += vals[i];
            }
            stopWatch.Stop();
            //Console.WriteLine($"{stopWatch.ElapsedMilliseconds} milliseconds with the counter on the stack");
            return stopWatch.ElapsedMilliseconds;
        }

        private void setup()
        {
            heapTotal = 0;
            for (int i = 0; i < vals.Length; ++i)
            {
                vals[i] = i;
            }
        }
    }
}

With some test sizes it's three times slower

That's the only cue towards the underlying problem. If you care about perf for long variables then do not use the x86 jitter. Alignment is critical and you can't get a good enough alignment guarantee in 32-bit mode.

The CLR can then only align to 4, that gives such a test 3 distinct outcomes. The variable can be aligned to 8, the fast version. And misaligned to 4 within a cache line, about 2x slower. And misaligned to 4 and straddling the L1 cache line boundary, about 3x slower. Same kind of problem with double btw.

Use Project > Properties > Build tab > untick the "Prefer 32-bit mode" checkbox. Just in case, use Tools > Options > Debugging > General > untick the "Suppress JIT optimization". Tweak the benchmark code, put a for-loop around the code, I always run it at least 10 times. Select the Release mode configuration and run the tests again.

You now have a completely different question, probably more along your expectations. Yes, local variables are not volatile by default, fields are. Having to update heapTotal inside the loop is the overhead you see.


This is from the heapTotaller disassembly:

            heapTotal = 0;
000007FE99F34966  xor         ecx,ecx  
000007FE99F34968  mov         qword ptr [rsi+10h],rcx  
            for (int i = 0; i < vals.Length; ++i)
000007FE99F3496C  mov         rax,qword ptr [rsi+8]  
000007FE99F34970  mov         edx,dword ptr [rax+8]  
000007FE99F34973  test        edx,edx  
000007FE99F34975  jle         000007FE99F34993  
            {
                heapTotal += vals[i];
000007FE99F34977  mov         r8,rax  
000007FE99F3497A  cmp         ecx,edx  
000007FE99F3497C  jae         000007FE99F349C8  
000007FE99F3497E  movsxd      r9,ecx  
000007FE99F34981  mov         r8d,dword ptr [r8+r9*4+10h]  
000007FE99F34986  movsxd      r8,r8d  
000007FE99F34989  add         qword ptr [rsi+10h],r8  

You can see it's using [rsi+10h] for the heapTotal variable.

And this is from the stackTotaller :

            long stackTotal = 0;
000007FE99F3427A  xor         ecx,ecx  
            for (int i = 0; i < vals.Length; ++i)
000007FE99F3427C  xor         eax,eax  
000007FE99F3427E  mov         rdx,qword ptr [rsi+8]  
000007FE99F34282  mov         r8d,dword ptr [rdx+8]  
000007FE99F34286  test        r8d,r8d  
000007FE99F34289  jle         000007FE99F342A8  
            {
                stackTotal += vals[i];
000007FE99F3428B  mov         r9,rdx  
000007FE99F3428E  cmp         eax,r8d  
000007FE99F34291  jae         000007FE99F342DD  
000007FE99F34293  movsxd      r10,eax  
000007FE99F34296  mov         r9d,dword ptr [r9+r10*4+10h]  
000007FE99F3429B  movsxd      r9,r9d  
000007FE99F3429E  add         rcx,r9  

You can see the JIT has optimized the code: it's using the RCX register for heapTotal .

Registers are faster than memory access, hence the speed improvement.

链接地址: http://www.djcxy.com/p/2236.html

上一篇: 深拷贝和浅拷贝之间有什么区别?

下一篇: 为什么堆中的总计操作比堆更快?