Why are mov ah,bh and mov al, bl together much faster than single instruction mov ax, bx?
I've found that
mov al, bl
mov ah, bh
is much faster than
mov ax, bx
Can anyone explain me why? I'm running on Core 2 Duo 3 Ghz, in 32-bit mode under Windows XP. Compiling using NASM and then linking with VS2010. Nasm compile command:
nasm -f coff -o triangle.o triangle.asm
Here is the main loop I'm using to render a triangle:
; some variables on stack
%define cr DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]
loop:
add esi, dcg
mov eax, esi
shr eax, 8
add edi, dcb
mov ebx, edi
shr ebx, 16
mov bh, ah
mov eax, cr
add eax, dcr
mov cr, eax
mov ah, bh ; faster
mov al, bl
;mov ax, bx
mov DWORD [edx], eax
add edx, 4
dec ecx
jge loop
I can provide whole VS project with sources for testing.
Why is it slow
The reason using a 16-bit register is expensive as opposed to using an 8-bit register is that 16-bit register instructions are decoded in microcode. This means an extra cycle during decoding and inability to be paired whilst decoding.
Also because ax is a partial register it will take an extra cycle to execute because the top part of the register needs to be combined with the write to the lower part.
8-bit writes have special hardware put in place to speed this up, but 16-bit writes do not. Again on many processors the 16-bit instructions take 2 cycles instead of one and they do not allow pairing.
This means that instead of being able to process 12 instructions (3 per cycle) in 4 cycles, you can now only execute 1, because you have a stall when decoding the instruction into microcode and a stall when processing the microcode.
How can I make it faster?
mov al, bl
mov ah, bh
(This code takes a minimum of 2 CPU-cycles and may give a stall on the second instruction because on some (older) x86 CPU's you get a lock on EAX)
Here's what happens:
On the lastest Core2 CPU's this is not so much of a problem, because extra hardware has been put in place that knows that bl
and bh
really never get in each other's way.
mov eax, ebx
Which moves 4 bytes at a time, that single instruction will run in 1 cpu-cycle (and can be paired with other instructions in parallel).
Speeding up the code
I see a few opportunities to speed up the code.
; some variables on stack
%define cr DWORD [ebp-20]
%define dcr DWORD [ebp-24]
%define dcg DWORD [ebp-32]
%define dcb DWORD [ebp-40]
mov edx,cr
loop:
add esi, dcg
mov eax, esi
shr eax, 8
add edi, dcb
mov ebx, edi
shr ebx, 16 ;higher 16 bits in ebx will be empty.
mov bh, ah
;mov eax, cr
;add eax, dcr
;mov cr, eax
add edx,dcr
mov eax,edx
and eax,0xFFFF0000 ; clear lower 16 bits in EAX
or eax,ebx ; merge the two.
;mov ah, bh ; faster
;mov al, bl
mov DWORD [epb+offset+ecx*4], eax ; requires storing the data in reverse order.
;add edx, 4
sub ecx,1 ;dec ecx does not change the carry flag, which can cause
;a false dependency on previous instructions which do change CF
jge loop
It is also faster on my Core 2 Duo CPU L9300 1.60GHz. As I wrote in a comment I think this is related to the use of partial registers ( ah
, al
, ax
). See more eg here, here and here (pg. 88).
I've written a small test suite to try and improve on the code, and while not using the ax
version presented in the OP is the smartest, trying to eliminate partial register usage does improve on the speed (even more so than my quick attempt at freeing up another register).
To get more information on why one version is faster than another I think requires more careful reading of the source material and/or using something like Intel VTune or AMD CodeAnalyst. (It could turn out that I'm wrong)
UPDATE, while the below output from oprofile doesn't prove anything it does show that there are a lot of partial register stalls occurring in both versions, but roughly twice as many in the slowest version (triAsm2) as in the 'fast' version (triAsm1).
$ opreport -l test
CPU: Core 2, speed 1600 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 800500
Counted RAT_STALLS events (Partial register stall cycles) with a unit mask of 0x0f (All RAT) count 1000000
samples % samples % symbol name
21039 27.3767 10627 52.3885 triAsm2.loop
16125 20.9824 4815 23.7368 triC
14439 18.7885 4828 23.8008 triAsm1.loop
12557 16.3396 0 0 triAsm3.loop
12161 15.8243 8 0.0394 triAsm4.loop
Complete oprofile output.
Results:
triC: 7410.000000 ms, a5afb9 (C implementation of the asm code)
triAsm1: 6690.000000 ms, a5afb9 (Code from OP, using al
and ah
)
triAsm2: 9290.000000 ms, a5afb9 (Code from OP, using ax
)
triAsm3: 5760.000000 ms, a5afb9 (Straight forward translation of OPs code to one without partial register usage)
triAsm4: 5640.000000 ms, a5afb9 (Quick attempt at making it faster)
Here is my test suite, compiled with -std=c99 -ggdb -m32 -O3 -march=native -mtune=native
:
test.c:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
extern void triC(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm1(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm2(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm3(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
extern void triAsm4(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb);
uint32_t scanline[640];
#define test(tri)
{
clock_t start = clock();
srand(60);
for (int i = 0; i < 5000000; i++) {
tri(scanline, rand() % 640, 10<<16, 20<<16, 30<<16, 1<<14, 1<<14, 1<<14);
}
printf(#tri ": %f ms, %xn",(clock()-start)*1000.0/CLOCKS_PER_SEC,scanline[620]);
}
int main() {
test(triC);
test(triAsm1);
test(triAsm2);
test(triAsm3);
test(triAsm4);
return 0;
}
tri.c:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
void triC(uint32_t* dest, uint32_t cnt, uint32_t cr, uint32_t cg, uint32_t cb, uint32_t dcr, uint32_t dcg, uint32_t dcb) {
while (cnt--) {
cr += dcr;
cg += dcg;
cb += dcb;
*dest++ = (cr & 0xffff0000) | ((cg >> 8) & 0xff00) | ((cb >> 16) & 0xff);
}
}
atri.asm:
bits 32
section .text
global triAsm1
global triAsm2
global triAsm3
global triAsm4
%define cr DWORD [ebp+0x10]
%define dcr DWORD [ebp+0x1c]
%define dcg DWORD [ebp+0x20]
%define dcb DWORD [ebp+0x24]
triAsm1:
push ebp
mov ebp, esp
pusha
mov edx, [ebp+0x08] ; dest
mov ecx, [ebp+0x0c] ; cnt
mov esi, [ebp+0x14] ; cg
mov edi, [ebp+0x18] ; cb
.loop:
add esi, dcg
mov eax, esi
shr eax, 8
add edi, dcb
mov ebx, edi
shr ebx, 16
mov bh, ah
mov eax, cr
add eax, dcr
mov cr, eax
mov ah, bh ; faster
mov al, bl
mov DWORD [edx], eax
add edx, 4
dec ecx
jge .loop
popa
pop ebp
ret
triAsm2:
push ebp
mov ebp, esp
pusha
mov edx, [ebp+0x08] ; dest
mov ecx, [ebp+0x0c] ; cnt
mov esi, [ebp+0x14] ; cg
mov edi, [ebp+0x18] ; cb
.loop:
add esi, dcg
mov eax, esi
shr eax, 8
add edi, dcb
mov ebx, edi
shr ebx, 16
mov bh, ah
mov eax, cr
add eax, dcr
mov cr, eax
mov ax, bx ; slower
mov DWORD [edx], eax
add edx, 4
dec ecx
jge .loop
popa
pop ebp
ret
triAsm3:
push ebp
mov ebp, esp
pusha
mov edx, [ebp+0x08] ; dest
mov ecx, [ebp+0x0c] ; cnt
mov esi, [ebp+0x14] ; cg
mov edi, [ebp+0x18] ; cb
.loop:
mov eax, cr
add eax, dcr
mov cr, eax
and eax, 0xffff0000
add esi, dcg
mov ebx, esi
shr ebx, 8
and ebx, 0x0000ff00
or eax, ebx
add edi, dcb
mov ebx, edi
shr ebx, 16
and ebx, 0x000000ff
or eax, ebx
mov DWORD [edx], eax
add edx, 4
dec ecx
jge .loop
popa
pop ebp
ret
triAsm4:
push ebp
mov ebp, esp
pusha
mov [stackptr], esp
mov edi, [ebp+0x08] ; dest
mov ecx, [ebp+0x0c] ; cnt
mov edx, [ebp+0x10] ; cr
mov esi, [ebp+0x14] ; cg
mov esp, [ebp+0x18] ; cb
.loop:
add edx, dcr
add esi, dcg
add esp, dcb
;*dest++ = (cr & 0xffff0000) | ((cg >> 8) & 0xff00) | ((cb >> 16) & 0xff);
mov eax, edx ; eax=cr
and eax, 0xffff0000
mov ebx, esi ; ebx=cg
shr ebx, 8
and ebx, 0xff00
or eax, ebx
;mov ah, bh
mov ebx, esp
shr ebx, 16
and ebx, 0xff
or eax, ebx
;mov al, bl
mov DWORD [edi], eax
add edi, 4
dec ecx
jge .loop
mov esp, [stackptr]
popa
pop ebp
ret
section .data
stackptr: dd 0
summary : 16-bit instructions are not the problem directly. The problem is reading wider registers after writing partial registers, causing a partial-register stall on Core2. This is much less of a problem on Sandybridge and later, since they merge much more cheaply. mov ax, bx
causes an extra merge, but even the OP's "fast" version has some stalls.
See the end of this answer for an alternate scalar inner-loop which should be faster than the other two answers, using shld
to shuffle bytes between registers. Pre-shifting things left by 8b outside the loop puts the byte we want at the top of each register, which makes this really cheap. It should run at slightly better than one iteration per 4 clock cycles on 32bit core2, and saturate all three execution ports with no stalls. It should run at one iteration per 2.5c on Haswell.
To actually do this fast though, look at auto-vectorized compiler output, and maybe pare that down or re-implement with vector intrinsics.
Contrary to the claims of 16bit operand-size instructions being slow, Core2 can in theory sustain 3 insns per clock alternating mov ax, bx
and mov ecx, edx
. There is no "mode switch" of any kind. (As everyone has pointed out, "context switch" is a terrible choice of made up name, because it already has a specific technical meaning.)
The problem is partial register stalls when you read a reg that you previously wrote only a part of. Instead of forcing a write to ax
a wait on the old contents of eax
being ready (false dependency), Intel P6-family CPUs track dependencies for partial regs separately. Reading the wider reg forces a merge, which stalls for 2 to 3 cycles according to Agner Fog. The other big problem with using 16bit operand size is with immediate operands, where you get can LCP stalls in the decoders on Intel CPUs for immediates that don't fit in an imm8.
SnB-family is much more efficient, just inserting an extra uop to do the merging without stalling while it does so. AMD and Intel Silvermont (and P4) don't rename partial registers separately at all, so they do have "false" dependencies on the previous contents. In this case, we're later reading the full register, so it's a true dependency because we want the merging, so those CPUs have an advantage. (Intel Haswell/Skylake (and maybe IvB) don't rename AL separately from RAX; they only rename AH/BH/CH/DH separately. And reading high8 registers has extra latency. See this Q&A about partial registers on HSW/SKL for the details.)
Neither of the partial-reg stalls are part of a long dependency chain, since the merged reg is overwritten in the next iteration. Apparently Core2 just stalls the front-end, or even the whole out-of-order execution core? I was meaning to ask a question about how expensive partial register slowdowns are on Core2, and how to measure the cost on SnB. @user786653's oprofile answer sheds some light on it. (And also has some really helpful C reverse-engineered from the OP's asm to help make it clear what this function is really trying to accomplish).
Compiling that C with a modern gcc can produce vectorized asm that does the loop 4 dwords at a time, in an xmm register. It does a much better job when it can use SSE4.1, though. (And clang doesn't auto-vectorize this at all with -march=core2
, but it does unroll a lot, probably interleaving multiple iterations to avoid partial-register stuff.) If you don't tell gcc that dest
is aligned, it generates a huge amount of scalar prologue/epilogue around the vectorized loop to reach a point where it is aligned.
It turns the integer args into vector constants (on the stack, since 32bit code only has 8 vector registers). The inner loop is
.L4:
movdqa xmm0, XMMWORD PTR [esp+64]
mov ecx, edx
add edx, 1
sal ecx, 4
paddd xmm0, xmm3
paddd xmm3, XMMWORD PTR [esp+16]
psrld xmm0, 8
movdqa xmm1, xmm0
movdqa xmm0, XMMWORD PTR [esp+80]
pand xmm1, xmm7
paddd xmm0, xmm2
paddd xmm2, XMMWORD PTR [esp+32]
psrld xmm0, 16
pand xmm0, xmm6
por xmm0, xmm1
movdqa xmm1, XMMWORD PTR [esp+48]
paddd xmm1, xmm4
paddd xmm4, XMMWORD PTR [esp]
pand xmm1, xmm5
por xmm0, xmm1
movaps XMMWORD PTR [eax+ecx], xmm0
cmp ebp, edx
ja .L4
Notice that there's one store in the whole loop. All the loads are just vectors it calculated earlier, stored on the stack as locals.
There are several ways to speed up the OP's code . Most obvious is that we don't need make a stack frame, freeing up ebp
. The most obvious use for it is to hold cr
, which the OP spills to the stack. user786653's triAsm4
does this, except he uses the insane troll logic variation of it: he makes a stack frame and sets up ebp
as usually, but then stashes esp
in a static location and uses it as a scratch register!! This will obviously break horribly if your program has any signal handlers, but otherwise is fine (except for making debugging harder).
If you're going to go so crazy that you want to use esp
as a scratch, copy the function args to static locations too, so you don't need a register to hold any pointers to stack memory. (Saving the old esp
in an MMX register is also an option, so you can do this in re-entrant functions used from multiple threads at once. But not if you copy the args somewhere static, unless it's to thread-local storage with a segment override or something. You don't have to worry about re-entrancy from within the same thread, because the stack pointer is in an unusable state. Anything like a signal handler that could re-enter your function in the same thread will instead crash. >.<)
Spilling cr
is actually not the most optimal choice: Instead of using two registers for looping (counter and pointer), we can just keep a dst pointer in a register. Do the loop boundary by calculating an end pointer (one past the end: dst+4*cnt
), and use a cmp
with a memory operand as the loop condition.
Comparing against an end-pointer with cmp
/ jb
is actually more optimal on Core2 than dec
/ jge
anyway. Unsigned conditions can macro-fuse with cmp
. Until SnB, only cmp
and test
can macro-fuse at all. (This is also true for AMD Bulldozer, but cmp and test can fuse with any jcc on AMD). SnB-family CPUs can macro-fuse dec
/ jge
. Interestingly, Core2 can only macro-fuse signed compares (like jge
) with test
, not cmp
. (An unsigned compare is the correct choice for an address anyway, since 0x8000000
is not special, but 0
is. I didn't use jb
just as a risky optimization.)
We can't pre-shift cb
and dcb
down to the low byte, because they need to maintain more precision internally. However, we can left shift the other two, so they're up against the left edge of their registers. Right-shifting them down to their destination position won't leave any garbage high bits from possible overflow.
Instead of merging into eax
, we could do overlapping stores. Store 4B from eax
, then store the low 2B from bx
. That would save the partial-reg stall in eax, but generate one for merging bh
into ebx
, so that's of limited value. Possibly a 4B write and two overlapping 1B stores are actually good here, but that's starting to be a lot of stores. Still, it might be spread over enough other instructions to not bottleneck on the store port.
user786653's triAsm3 uses masking and or
instructions for merging, which looks like a sensible approach for Core2. For AMD, Silvermont, or P4, using 8b and 16b mov instructions to merge partial registers is probably actually good. You can also take advantage of it on Ivybridge/Haswell/Skylake if you only write the low8 or low16 to avoid merging penalties. However, I came up with multiple improvements over that to require less masking.
; use defines you can put [] around so it's clear they're memory refs ; %define cr ebp+0x10 %define cr esp+something that depends on how much we pushed %define dcr ebp+0x1c ;; change these to work from ebp, too. %define dcg ebp+0x20 %define dcb ebp+0x24 ; esp-relative offsets may be wrong, just quickly did it in my head without testing: ; we push 3 more regs after ebp, which was the point at which ebp snapshots esp in the stack-frame version. So add 0xc (i.e. mentally add 0x10 and subract 4) ; 32bit code is dumb anyway. 64bit passes args in regs. %define dest_arg esp+14 %define cnt_arg esp+18 ... everything else tri_pjc: push ebp push edi push esi push ebx ; only these 4 need to be preserved in the normal 32bit calling convention mov ebp, [cr] mov esi, [cg] mov edi, [cb] shl esi, 8 ; put the bits we want at the high edge, so we don't have to mask after shifting in zeros shl [dcg], 8 shl edi, 8 shl [dcb], 8 ; apparently the original code doesn't care if cr overflows into the top byte. mov edx, [dest_arg] mov ecx, [cnt_arg] lea ecx, [edx + ecx*4] ; one-past the end, to be used as a loop boundary mov [dest_arg], ecx ; spill it back to the stack, where we only need to read it. ALIGN 16 .loop: ; SEE BELOW, this inner loop can be even more optimized add esi, [dcg] mov eax, esi shr eax, 24 ; eax bytes = { 0 0 0 cg } add edi, [dcb] shld eax, edi, 8 ; eax bytes = { 0 0 cg cb } add ebp, [dcr] mov ecx, ebp and ecx, 0xffff0000 or eax, ecx ; eax bytes = { x cr cg cb} where x is overflow from cr. Kill that by changing the mask to 0x00ff0000 ; another shld to merge might be faster on other CPUs, but not core2 ; merging with mov cx, ax would also be possible on CPUs where that's cheap (AMD, and Intel IvB and later) mov DWORD [edx], eax ; alternatively: ; mov DWORD [edx], ebp ; mov WORD [edx], eax ; this insn replaces the mov/and/or merging add edx, 4 cmp edx, [dest_arg] ; core2 can macro-fuse cmp/unsigned condition, but not signed jb .loop pop ebx pop esi pop edi pop ebp ret
I ended up with one more register than I needed, after doing the omit-frame-pointer and putting the loop-boundary in memory. You could either cache something extra in registers, or avoid saving / restoring a register. Maybe keeping the loop boundary in ebx
is the best bet. It basically saves one prologue instruction. Keeping dcb
or dcg
in a register would require an extra insn in the prologue to load it. (The shifts with a memory destination are ugly and slow, even on Skylake, but small code size. They're not in the loop, and core2 doesn't have a uop cache. load/shift/store separately is still 3 uops, so you can't beat it unless you are going to keep it in a reg instead of storing.)
shld
is a 2-uop insn on P6 (Core2). Luckily, it's easy to order the loop so it's the fifth instruction, preceded by four single-uop instructions. It should hit the decoders as the first uop in the 2nd group of 4, so it doesn't cause a delay in the frontend. (Core2 can decode 1-1-1-1, 2-1-1-1, 3-1-1-1, or 4-1-1-1 uops-per-insn patterns. SnB and later redesigned the decoders, and added a uop cache that makes decoding not usually the bottleneck, and can handle only groups of 1-1-1-1, 2-1-1, 3-1, and 4.)
shld
is horrible on AMD K8, K10, Bulldozer-family, and Jaguar. 6 m-ops, 3c latency, and one per 3c throughput. It's great on Atom/Silvermont with 32bit operand-size, but horrible with 16 or 64b registers.
This insn ordering might decode with the cmp
as the last insn of a group, and then jb
by itself, making it not macro-fuse. This might give an extra advantage to the overlapping-stores method of merging, more than just saving a uop, if front-end effects are a factor for this loop. (And I suspect they would be, given the high degree of parallelism and that the loop-carried dep chains are short, so work for multiple iterations can be happening at once.)
So: fused-domain uops per iteration: 13 on Core2 (assuming macro-fusion which might not actually happen), 12 on SnB-family. So IvB should run this at one iteration per 3c (assuming none of the 3 ALU ports are a bottleneck. The mov r,r
don't need ALU ports, and neither does the store. add
and booleans can use any port. shr
and shld
are the only that can't run on a wide choice of ports, and there are only two shifts per three cycles.) Core2 will take 4c per iteration to issue it even if it manages to avoid any frontend bottlenecks, and even longer to run it.
We're maybe still running fast enough on Core2 that spilling/reloading cr
to the stack every iteration would be a bottleneck if we were still doing that. It adds a memory round-trip (5c) to a loop-carried dependency chain, making a total dep chain length of 6 cycles (including the add).
Hmm, actually even Core2 might win from using two shld
insns to merge. It also saves another register!
ALIGN 16 ;mov ebx, 111 ; IACA start ;db 0x64, 0x67, 0x90 .loop: add ebp, [dcr] mov eax, ebp shr eax, 16 ; eax bytes = { 0 0 x cr} where x is overflow from cr. Kill that pre-shifting cr and dcr like the others, and use shr 24 here add esi, [dcg] shld eax, esi, 8 ; eax bytes = { 0 x cr cg} add edx, 4 ; this goes between the `shld`s to help with decoder throughput on pre-SnB, and to not break macro-fusion. add edi, [dcb] shld eax, edi, 8 ; eax bytes = { x cr cg cb} mov DWORD [edx-4], eax cmp edx, ebx ; use our spare register here jb .loop ; core2 can macro-fuse cmp/unsigned condition, but not signed. Macro-fusion works in 32-bit mode only on Core2. ;mov ebx, 222 ; IACA end ;db 0x64, 0x67, 0x90
Per-iteration: SnB: 10 fused-domain uops. Core2: 12 fused-domain uops, so this is shorter than the previous version on Intel CPUs (but horrible on AMD). Using shld
saves mov
instructions because we can use it to non-destructively extract the high byte of the source.
Core2 can issue the loop at one iteration per 3 clocks. (It was Intel's first CPU with a 4 uop wide pipeline).
From Agner Fog's table for Merom/Conroe (first gen Core2) (note that David Kanter's block diagram has p2 and p5 reversed):
shr
: runs on p0/p5 shld
: 2 uops for p0/p1/p5? Agner's table for pre-Haswell doesn't say which uops can go where. mov r,r
, add
, and
: p0/p1/p5 According to IACA, which has a mode for Nehalem but not Core2, most of the shld
uops go to p1, with only less than 0.6 on average from each insn running on other ports. Nehalem has essentially the same execution units as Core2. All the instructions involved here have the same uop costs and port requirements on NHM and Core2. IACA's analysis looks good to me, and I don't want to check everything on my own for this answer to a 5 year old question. It was fun answering, though. :)
Anyway, according to IACA, uops should distribute well between ports. It figures Nehalem can run the loop at one iteration per 3.7 cycles, saturating all three execution ports. It's analysis looks good to me. (Note that I had to drop the memory operand from the cmp
to make IACA not give stupid results.) That's clearly needed anyway, since pre-SnB can only do one load per cycle: we'd bottleneck on port2 with four loads in the loop.
IACA doesn't agree with Agner Fog's testing for IvB and SnB (it thinks shld is still 2 uops, when it's actually one, according to my testing on SnB). So its numbers are silly.
IACA looks correct for Haswell, where it says the bottleneck is the frontend. It thinks HSW can run it at one per 2.5c. (The loop buffer in Haswell at least can issue loops in a non-integer number of cycles per iteration. Sandybridge may be limited to whole numbers of cycles, where the taken loop-branch ends an issue-group.)
I also found I needed to use iaca.sh -no_interiteration
, or else it would think there was an interiteration loop-carried dependency and think the loop would take 12c on NHM.