SSE on x86, stack alignment

The SSE code I have got was written for x64, where the stack is aligned. The optimised code paths have now been requested on x86 (for MSVC/Windows and GCC/Linux). Getting this working on MSVC first.

Now apart from some inlines that took more than 3 __m128 parameters which it refused to compile (fixed by making a const ref and hoping the compiler will optimize it out) everything seems to work as is.

//error C2719: 'd': formal parameter with __declspec(align('16')) won't be aligned
inline __m128i foo(__m128i a, __m128i b, __m128i c, __m128i d) {...}

However I was under the impression the stack is not 16byte aligned on x86. Yet some __declspec(align(16)) arrays on the stack didnt even get a warning, and I am sure it must be pushing and popping the __m128's (I recall working out 12 registers were required on x64, and even then it moved some to the stack it didn't need for a bit and did its own thing anyway).

I even added some asserts on the array memory addresses (and turned off NDEBUG) and they all seem to pass.

__declspec(align(16)) uint32_t blocks[64];
assert(((uintptr_t)blocks) % 16 == 0);

__m128i a = ...;
__m128i b = ...;
__m128i c = ...;
__m128i d = ...;
__m128i e = ...;
__m128i f = ...;
__m128i g = ...;
//do other stuff, which surely means there is not enough registers on x86

Did I just get really lucky or is there some magic going on here to realign the stack? And is this portable? I am sure I recall having issues getting some D3DX stuff to align on x86 when I was doing D3D9 back with VS2008.

One thing I did get a bunch of warnings for however was the __m128 -> __m128& conversions being non-standard. Is this really not supported on some compiler that does support SSE, and how is one meant to avoid it (eg inlines with output __m128's, or more than 3 params)?

Also a quick look suggests somehow MS themselves break these rules (eg XMMatrixTransformation http://msdn.microsoft.com/en-us/library/windows/desktop/microsoft.directx_sdk.matrix.xmmatrixtransformation%28v=vs.85%29.aspx takes 6 SSE objects, the only difference I can see being there wrapped in structs)

XMMATRIX XMMatrixTransformation(
  [in]  XMVECTOR ScalingOrigin,
  [in]  XMVECTOR ScalingOrientationQuaternion,
  [in]  XMVECTOR Scaling,
  [in]  XMVECTOR RotationOrigin,
  [in]  XMVECTOR RotationQuaternion,
  [in]  XMVECTOR Translation
);
链接地址: http://www.djcxy.com/p/72734.html

上一篇: 性能AVX / SSE组件与内部函数

下一篇: x86上的SSE,堆栈对齐