catch speeding up my code?
I wrote some code for testing the impact of try-catch, but seeing some surprising results.
static void Main(string[] args)
{
Thread.CurrentThread.Priority = ThreadPriority.Highest;
Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;
long start = 0, stop = 0, elapsed = 0;
double avg = 0.0;
long temp = Fibo(1);
for (int i = 1; i < 100000000; i++)
{
start = Stopwatch.GetTimestamp();
temp = Fibo(100);
stop = Stopwatch.GetTimestamp();
elapsed = stop - start;
avg = avg + ((double)elapsed - avg) / i;
}
Console.WriteLine("Elapsed: " + avg);
Console.ReadKey();
}
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
return fibo;
}
On my computer, this consistently prints out a value around 0.96..
When I wrap the for loop inside Fibo() with a try-catch block like this:
static long Fibo(int n)
{
long n1 = 0, n2 = 1, fibo = 0;
n++;
try
{
for (int i = 1; i < n; i++)
{
n1 = n2;
n2 = fibo;
fibo = n1 + n2;
}
}
catch {}
return fibo;
}
Now it consistently prints out 0.69... -- it actually runs faster! But why?
Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio).
EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. I also tested using int
types inside the Fibo method instead of long
types, and then the x86 CLR was as equally fast as the x64 CLR.
UPDATE: It looks like this issue has been fixed by Roslyn. Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015.
One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. The result is suboptimal code generation on the loads and stores of the locals.
For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region.
This is pretty weird. We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this.
Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier.
Thanks for bringing this to our attention, and apologies for the odd behaviour.
Well, the way you're timing things looks pretty nasty to me. It would be much more sensible to just time the whole loop:
var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);
That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error.
Having made that change, see whether the "non-catch" version is still slower than the "catch" version.
EDIT: Okay, I've tried it myself - and I'm seeing the same result. Very odd. I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)]
instead didn't help...
Basically you'll need to look at the optimized JITted code under cordbg, I suspect...
EDIT: A few more bits of information:
n++;
line still improves performance, but not by as much as putting it around the whole block ArgumentException
in my tests) it's still fast Weird...
EDIT: Okay, we have disassembly...
This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). I still see the same performance effects, even under the debugger. The fast version uses a try
block around everything between the variable declarations and the return statement, with just a catch{}
handler. Obviously the slow version is the same except without the try/catch. The calling code (ie Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue).
Disassembled code for fast version:
[0000] push ebp
[0001] mov ebp,esp
[0003] push edi
[0004] push esi
[0005] push ebx
[0006] sub esp,1Ch
[0009] xor eax,eax
[000b] mov dword ptr [ebp-20h],eax
[000e] mov dword ptr [ebp-1Ch],eax
[0011] mov dword ptr [ebp-18h],eax
[0014] mov dword ptr [ebp-14h],eax
[0017] xor eax,eax
[0019] mov dword ptr [ebp-18h],eax
*[001c] mov esi,1
[0021] xor edi,edi
[0023] mov dword ptr [ebp-28h],1
[002a] mov dword ptr [ebp-24h],0
[0031] inc ecx
[0032] mov ebx,2
[0037] cmp ecx,2
[003a] jle 00000024
[003c] mov eax,esi
[003e] mov edx,edi
[0040] mov esi,dword ptr [ebp-28h]
[0043] mov edi,dword ptr [ebp-24h]
[0046] add eax,dword ptr [ebp-28h]
[0049] adc edx,dword ptr [ebp-24h]
[004c] mov dword ptr [ebp-28h],eax
[004f] mov dword ptr [ebp-24h],edx
[0052] inc ebx
[0053] cmp ebx,ecx
[0055] jl FFFFFFE7
[0057] jmp 00000007
[0059] call 64571ACB
[005e] mov eax,dword ptr [ebp-28h]
[0061] mov edx,dword ptr [ebp-24h]
[0064] lea esp,[ebp-0Ch]
[0067] pop ebx
[0068] pop esi
[0069] pop edi
[006a] pop ebp
[006b] ret
Disassembled code for slow version:
[0000] push ebp
[0001] mov ebp,esp
[0003] push esi
[0004] sub esp,18h
*[0007] mov dword ptr [ebp-14h],1
[000e] mov dword ptr [ebp-10h],0
[0015] mov dword ptr [ebp-1Ch],1
[001c] mov dword ptr [ebp-18h],0
[0023] inc ecx
[0024] mov esi,2
[0029] cmp ecx,2
[002c] jle 00000031
[002e] mov eax,dword ptr [ebp-14h]
[0031] mov edx,dword ptr [ebp-10h]
[0034] mov dword ptr [ebp-0Ch],eax
[0037] mov dword ptr [ebp-8],edx
[003a] mov eax,dword ptr [ebp-1Ch]
[003d] mov edx,dword ptr [ebp-18h]
[0040] mov dword ptr [ebp-14h],eax
[0043] mov dword ptr [ebp-10h],edx
[0046] mov eax,dword ptr [ebp-0Ch]
[0049] mov edx,dword ptr [ebp-8]
[004c] add eax,dword ptr [ebp-1Ch]
[004f] adc edx,dword ptr [ebp-18h]
[0052] mov dword ptr [ebp-1Ch],eax
[0055] mov dword ptr [ebp-18h],edx
[0058] inc esi
[0059] cmp esi,ecx
[005b] jl FFFFFFD3
[005d] mov eax,dword ptr [ebp-1Ch]
[0060] mov edx,dword ptr [ebp-18h]
[0063] lea esp,[ebp-4]
[0066] pop esi
[0067] pop ebp
[0068] ret
In each case the *
shows where the debugger entered in a simple "step-into".
EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. For small values of n
that's possibly faster - but when the loop takes up the bulk of the time, it's slower.
Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code.
EDIT: Just tried this on my x64 machine. The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference.
Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers ( esi,edi
) to store one of the local variables where the slow version doesn't.
The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. This causes it to make different register allocation choices. In this case, this favors the code with the try-catch block. Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique.
In the end, it's very hard to tell which code will end up running the fastest. Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code.
For example, consider the following two methods. They were adapted from a real-life example:
interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed {
public int[] Array;
public int this[int index] {
get { return Array[index]; }
set { Array[index] = value; }
}
}
static int Generic<T>(int length, T a, T b) where T : IIndexed {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
int sum = 0;
for (int i = 0; i < length; i++)
sum += a[i] * b[i];
return sum;
}
One is a generic version of the other. Replacing the generic type with StructArray
would make the methods identical. Because StructArray
is a value type, it gets its own compiled version of the generic method. Yet the actual running time is significantly longer than the specialized method's, but only for x86. For x64, the timings are pretty much identical. In other cases, I've observed differences for x64 as well.
下一篇: 赶上加速我的代码?