How can I improve the compiler's handling of my SSE intrinsics?
Having read this interesting article on the results of intrinsic-guided optimization of SSE code in different C++ compilers I decided to do a test of my own, especially since the post is a few years old. I used MSVC which did so very poorly in the tests performed by the author of the post (although in the VS 2010 version) and decided to stick to a very basic scenario: packing some values into a XMM register and doing a simple operation like addition. In the article, _mm_set_ps translated into a weird sequence of scalar move and unpack instructions, so let's see:
int _tmain(int argc, _TCHAR* argv[])
{
__m128 foo = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 bar = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);
__m128 ret = _mm_add_ps(foo, bar);
// need to do something so vars won't be optimized out in Release
float *f = (float *)(&ret);
for (int i = 0; i < 4; i++)
{
cout << "f[" << i << "] = " << f[i] << endl;
}
}
Next, I compiled and ran this inside the debugger, looking at the disassembly:
Debug:
__m128 foo = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
00B814F0 movaps xmm0,xmmword ptr ds:[0B87840h]
00B814F7 movaps xmmword ptr [ebp-190h],xmm0
00B814FE movaps xmm0,xmmword ptr [ebp-190h]
00B81505 movaps xmmword ptr [foo],xmm0
__m128 bar = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);
00B81509 movaps xmm0,xmmword ptr ds:[0B87850h]
00B81510 movaps xmmword ptr [ebp-170h],xmm0
00B81517 movaps xmm0,xmmword ptr [ebp-170h]
00B8151E movaps xmmword ptr [bar],xmm0
__m128 ret = _mm_add_ps(foo, bar);
00B81522 movaps xmm0,xmmword ptr [bar]
00B81526 movaps xmm1,xmmword ptr [foo]
00B8152A addps xmm1,xmm0
00B8152D movaps xmmword ptr [ebp-150h],xmm1
00B81534 movaps xmm0,xmmword ptr [ebp-150h]
00B8153B movaps xmmword ptr [ret],xmm0
Utterly confused; why does putting a xmmword into a __m128 require four MOVAPS? First, it puts the data into xmm0 (I assume it's the literal for the four float values stored somewhere, not sure how to look at it), then copies xmm0 somewhere pointed to by ebp and an offset, only to copy it back again from there to xmm0 (?), and finally to the location of the variable that's supposed to store it. Why so much work?
Release: This time I was expecting the compiler to avoid storing xmmword's in memory at all, just put one in xmm0, other in xmm1, do an ADDPS, put result in memory and be done with it. Instead I got:
__m128 foo = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 bar = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);
__m128 ret = _mm_add_ps(foo, bar);
003E1009 movaps xmm0,xmmword ptr ds:[3E2130h]
003E1010 push esi
003E1011 movaps xmmword ptr [esp+10h],xmm0
Apparently, no need for ADDPS. I'm guessing the compiler noticed the two xmmwords were compile-time constants so it just added them, putting the result in the code as a literal? The odd push has probably to do with the for-loop that follows, because esi is used as a loop counter there as far as I can tell. Still, why put the precalculated literal from the data segment into xmm0 and then into a local variable (esp+10h), why not just use the literal directly?
To sum up, the Debug version was more stupid than I expected (or maybe I'm not getting something), while the Release version was unexpectedly clever. Any comments explaining this behaviour would be greatly appreciated. Thanks.
EDIT: The answers were very enlightening, but I still would like to know if there's anything I can do to improve the compiler output, which is why I am changing the question from asking for explanation of this to the current form.
For example, would it be possible to somehow guide the compiler to not store foo and bar in memory (since I don't need them after the addition), just load them into xmmN registers and keep them there? Possibly ret too? The author of the cited article said MSVC was just "doing exactly what it was told to". Any way to get better (read: avoiding memory transfers) code without explicitly writing an __asm block? Thanks.
This is just a normal side effect of the way the code generator works. The _mm_set_ps() has two distinct jobs to do. It first must built up the __m128 value from the 4 arguments. You picked the easy way, it gets a lot more convoluted with:
float x = 1.0f;
__m128 foo = _mm_set_ps(x, 2.0f, 3.0f, 4.0f);
With drastically different codegen:
00C513DD movss xmm0,dword ptr ds:[0C5585Ch]
00C513E5 movss xmm1,dword ptr [x]
00C513EA movaps xmm2,xmmword ptr ds:[0C55860h]
00C513F1 unpcklps xmm0,xmm1
00C513F4 unpcklps xmm2,xmm0
00C513F7 movaps xmmword ptr [ebp-100h],xmm2
The second job is then to move it into the __m128 variable, that's easy
00C513FE movaps xmm0,xmmword ptr [ebp-100h]
00C51405 movaps xmmword ptr [foo],xmm0
That this isn't optimized yet is simply because the optimizer is turned off in the Debug build. The code generator doesn't make any attempt to optimize, that's just not its job.
And sure, the optimizer was capable of calculating the result at compile time. That even works for the convoluted example, you've already seen this:
00EE1284 movaps xmm0,xmmword ptr ds:[0EE3260h]
This is really a question about MSVC internals. To get a definite answer, you'd have to ask Microsoft.
One might speculate that the reason the Release build puts ret into a local variable is that you have taken its address. Taking the address of a variable means the compiler suddenly has to deal with memory rather than registers. Memory is much harder for a compiler because other places in the program may have pointers to that the optimizer must account for.
You're right about the compile-time optimization for the release build (look up ds:[3E2130h]
in your object file and you will find the added values there).
Yes, the debug version seems to do unnecessary work, but only by a factor of 2, not a factor of 4. One would actually expect that
movaps xmmword ptr [foo],xmmword ptr ds:[0B87840h]
existed, but it doesn't, MOVAPS
comes in two variants, and neither allows moving from memory to memory (this is the usual case in x86):
MOVAPS xmm1,xmm2/mem128 ; 0F 28 /r [KATMAI,SSE]
MOVAPS xmm1/mem128,xmm2 ; 0F 29 /r [KATMAI,SSE]
What the debug assembly does is read the xmmword from ds:[0B87840h]
in the .data
section of your object file (which is most likely readonly), and puts it on the stack at [ebp-190h]
as well as in foo
.
For comparison, gcc 4.7 exhibits a similar pattern:
movaps xmm0, XMMWORD PTR .LC0[rip] # D.5374,
movaps XMMWORD PTR [rbp-64], xmm0 # foo, D.5353
movaps xmm0, XMMWORD PTR .LC1[rip] # D.5381,
movaps XMMWORD PTR [rbp-48], xmm0 # bar, D.5354
movaps xmm0, XMMWORD PTR [rbp-64] # tmp79, foo
movaps XMMWORD PTR [rbp-32], xmm0 # __A, tmp79
movaps xmm0, XMMWORD PTR [rbp-48] # tmp80, bar
movaps XMMWORD PTR [rbp-16], xmm0 # __B, tmp80
movaps xmm0, XMMWORD PTR [rbp-16] # tmp81, __B
movaps xmm1, XMMWORD PTR [rbp-32] # tmp82, __A
addps xmm0, xmm1 # D.5386, tmp82
I would assume that this has to do with the way that the builtin intrinsics are implemented. For example, _mm_add_ps
works with __m128
arguments that may be in registers, on the stack or somewhere else at the time when it is called. Therefore, if you're writing the intrinsics code for gcc/VC++, you have to generate code first that will load the values. When the optimizer runs, it immediately notices that there is unnecessary pushing around of data (but the optimizer does not run in debug builds).
上一篇: 使用GCC生成可读的组件?