catch speeding up my code?

2018-05-30 05:11:42

I wrote some code for testing the impact of try-catch, but seeing some surprising results.

static void Main(string[] args)
{
    Thread.CurrentThread.Priority = ThreadPriority.Highest;
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;

    long start = 0, stop = 0, elapsed = 0;
    double avg = 0.0;

    long temp = Fibo(1);

    for (int i = 1; i < 100000000; i++)
    {
        start = Stopwatch.GetTimestamp();
        temp = Fibo(100);
        stop = Stopwatch.GetTimestamp();

        elapsed = stop - start;
        avg = avg + ((double)elapsed - avg) / i;
    }

    Console.WriteLine("Elapsed: " + avg);
    Console.ReadKey();
}

static long Fibo(int n)
{
    long n1 = 0, n2 = 1, fibo = 0;
    n++;

    for (int i = 1; i < n; i++)
    {
        n1 = n2;
        n2 = fibo;
        fibo = n1 + n2;
    }

    return fibo;
}

On my computer, this consistently prints out a value around 0.96..

When I wrap the for loop inside Fibo() with a try-catch block like this:

static long Fibo(int n)
{
    long n1 = 0, n2 = 1, fibo = 0;
    n++;

    try
    {
        for (int i = 1; i < n; i++)
        {
            n1 = n2;
            n2 = fibo;
            fibo = n1 + n2;
        }
    }
    catch {}

    return fibo;
}

Now it consistently prints out 0.69... -- it actually runs faster! But why?

Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio).

EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using int types inside the Fibo method instead of long types, and then the x86 CLR was as equally fast as the x64 CLR.

UPDATE: It looks like this issue has been fixed by Roslyn. Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015.

One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.

For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.

This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.

Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.

Thanks for bringing this to our attention, and apologies for the odd behaviour.

Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:

var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
    Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);

That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.

Having made that change, see whether the "non-catch" version is still slower than the "catch" version.

EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)] instead didn't help...

Basically you'll need to look at the optimized JITted code under cordbg, I suspect...

EDIT: A few more bits of information:

Putting the try/catch around just the n++; line still improves performance, but not by as much as putting it around the whole block

If you catch a specific exception ( ArgumentException in my tests) it's still fast

If you print the exception in the catch block it's still fast

If you rethrow the exception in the catch block it's slow again

If you use a finally block instead of a catch block it's slow again

If you use a finally block as well as a catch block, it's fast

Weird...

EDIT: Okay, we have disassembly...

This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a try block around everything between the variable declarations and the return statement, with just a catch{} handler. Obviously the slow version is the same except without the try/catch. The calling code (ie Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).

Disassembled code for fast version:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        edi
 [0004] push        esi
 [0005] push        ebx
 [0006] sub         esp,1Ch
 [0009] xor         eax,eax
 [000b] mov         dword ptr [ebp-20h],eax
 [000e] mov         dword ptr [ebp-1Ch],eax
 [0011] mov         dword ptr [ebp-18h],eax
 [0014] mov         dword ptr [ebp-14h],eax
 [0017] xor         eax,eax
 [0019] mov         dword ptr [ebp-18h],eax
*[001c] mov         esi,1
 [0021] xor         edi,edi
 [0023] mov         dword ptr [ebp-28h],1
 [002a] mov         dword ptr [ebp-24h],0
 [0031] inc         ecx
 [0032] mov         ebx,2
 [0037] cmp         ecx,2
 [003a] jle         00000024
 [003c] mov         eax,esi
 [003e] mov         edx,edi
 [0040] mov         esi,dword ptr [ebp-28h]
 [0043] mov         edi,dword ptr [ebp-24h]
 [0046] add         eax,dword ptr [ebp-28h]
 [0049] adc         edx,dword ptr [ebp-24h]
 [004c] mov         dword ptr [ebp-28h],eax
 [004f] mov         dword ptr [ebp-24h],edx
 [0052] inc         ebx
 [0053] cmp         ebx,ecx
 [0055] jl          FFFFFFE7
 [0057] jmp         00000007
 [0059] call        64571ACB
 [005e] mov         eax,dword ptr [ebp-28h]
 [0061] mov         edx,dword ptr [ebp-24h]
 [0064] lea         esp,[ebp-0Ch]
 [0067] pop         ebx
 [0068] pop         esi
 [0069] pop         edi
 [006a] pop         ebp
 [006b] ret

Disassembled code for slow version:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        esi
 [0004] sub         esp,18h
*[0007] mov         dword ptr [ebp-14h],1
 [000e] mov         dword ptr [ebp-10h],0
 [0015] mov         dword ptr [ebp-1Ch],1
 [001c] mov         dword ptr [ebp-18h],0
 [0023] inc         ecx
 [0024] mov         esi,2
 [0029] cmp         ecx,2
 [002c] jle         00000031
 [002e] mov         eax,dword ptr [ebp-14h]
 [0031] mov         edx,dword ptr [ebp-10h]
 [0034] mov         dword ptr [ebp-0Ch],eax
 [0037] mov         dword ptr [ebp-8],edx
 [003a] mov         eax,dword ptr [ebp-1Ch]
 [003d] mov         edx,dword ptr [ebp-18h]
 [0040] mov         dword ptr [ebp-14h],eax
 [0043] mov         dword ptr [ebp-10h],edx
 [0046] mov         eax,dword ptr [ebp-0Ch]
 [0049] mov         edx,dword ptr [ebp-8]
 [004c] add         eax,dword ptr [ebp-1Ch]
 [004f] adc         edx,dword ptr [ebp-18h]
 [0052] mov         dword ptr [ebp-1Ch],eax
 [0055] mov         dword ptr [ebp-18h],edx
 [0058] inc         esi
 [0059] cmp         esi,ecx
 [005b] jl          FFFFFFD3
 [005d] mov         eax,dword ptr [ebp-1Ch]
 [0060] mov         edx,dword ptr [ebp-18h]
 [0063] lea         esp,[ebp-4]
 [0066] pop         esi
 [0067] pop         ebp
 [0068] ret

In each case the * shows where the debugger entered in a simple "step-into".

EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of n that's possibly faster - but when the loop takes up the bulk of the time, it's slower.

Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code.

EDIT: Just tried this on my x64 machine. The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.

Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers ( esi,edi ) to store one of the local variables where the slow version doesn't.

The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.

In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.

For example, consider the following two methods. They were adapted from a real-life example:

interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed { 
    public int[] Array;
    public int this[int index] {
        get { return Array[index]; }
        set { Array[index] = value; }
    }
}

static int Generic<T>(int length, T a, T b) where T : IIndexed {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}

One is a generic version of the other. Replacing the generic type with StructArray would make the methods identical. Because StructArray is a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.

链接地址: http://www.djcxy.com/p/3582.html

上一篇: arr）[0]编译和等于sizeof（我的

下一篇: 赶上加速我的代码？