Can a shift using the CL register result in a partial register stall?

Can a variable shift generate a partial register stall (or register recombining µops) on ecx ? If so, on which microarchitecture(s)?

I have tested this on Core2 (65nm), which seems to read only cl .

_shiftbench:
    push rbx
    mov edx, -10000000
    mov ecx, 5
  _shiftloop:
    mov bl, 5   ; replace by cl to see possible recombining
    shl eax, cl
    add edx, 1
    jnz _shiftloop
    pop rbx
    ret

Replacing mov bl, 5 by mov cl, 5 made no difference, which it would have if there was register recombining going on, as can be demonstrated by replacing shl eax, cl by add eax, ecx (in my tests the version with add experienced a 2.8x slowdown when writing to cl instead of bl ).


Test results:

  • Merom: no stall observed
  • Penryn: no stall observed
  • Nehalem: no stall observed
  • Update: the new shrx -group of shifts in Haswell does show that stall. The shift-count argument is not written as an 8bit register, so that might have been expected, but the textual representation really doesn't say anything about such micro-architectural details.


    As currently phrased (“Can a shift using the CL register …”) the question's title contains its own answer: with a modern processor, there is never a partial register stall on CL because CL can never be recombined from something smaller.

    Yes, the processor knows that the amount you are shifting by is effectively contained in CL, the 5 or 6 least significant bits of CL to be precise. One way it could have stalled on ECX was if the granularity at which it considered instruction dependencies did not go below full registers. This worry is obsolete, though: the newest Intel processor that would have consider the whole ECX register as dependency was the Pentium 4. See Agner Fog's unofficial optimization manual, page 121. But then again, with the P4 this would not be called a partial register stall, the program could only be victim of a false dependency (say, if CH was modifier just before the shift).

    链接地址: http://www.djcxy.com/p/65226.html

    上一篇: 查询的NodeLists中的“order”有多可靠

    下一篇: 使用CL寄存器进行移位会导致寄存器部分失速?