Can a shift using the CL register result in a partial register stall?
Can a variable shift generate a partial register stall (or register recombining µops) on ecx
? If so, on which microarchitecture(s)?
I have tested this on Core2 (65nm), which seems to read only cl
.
_shiftbench:
push rbx
mov edx, -10000000
mov ecx, 5
_shiftloop:
mov bl, 5 ; replace by cl to see possible recombining
shl eax, cl
add edx, 1
jnz _shiftloop
pop rbx
ret
Replacing mov bl, 5
by mov cl, 5
made no difference, which it would have if there was register recombining going on, as can be demonstrated by replacing shl eax, cl
by add eax, ecx
(in my tests the version with add
experienced a 2.8x slowdown when writing to cl
instead of bl
).
Test results:
Update: the new shrx
-group of shifts in Haswell does show that stall. The shift-count argument is not written as an 8bit register, so that might have been expected, but the textual representation really doesn't say anything about such micro-architectural details.
As currently phrased (“Can a shift using the CL register …”) the question's title contains its own answer: with a modern processor, there is never a partial register stall on CL because CL can never be recombined from something smaller.
Yes, the processor knows that the amount you are shifting by is effectively contained in CL, the 5 or 6 least significant bits of CL to be precise. One way it could have stalled on ECX was if the granularity at which it considered instruction dependencies did not go below full registers. This worry is obsolete, though: the newest Intel processor that would have consider the whole ECX register as dependency was the Pentium 4. See Agner Fog's unofficial optimization manual, page 121. But then again, with the P4 this would not be called a partial register stall, the program could only be victim of a false dependency (say, if CH was modifier just before the shift).
链接地址: http://www.djcxy.com/p/65226.html