SSE on x86, stack alignment
The SSE code I have got was written for x64, where the stack is aligned. The optimised code paths have now been requested on x86 (for MSVC/Windows and GCC/Linux). Getting this working on MSVC first.
Now apart from some inlines that took more than 3 __m128 parameters which it refused to compile (fixed by making a const ref and hoping the compiler will optimize it out) everything seems to work as is.
//error C2719: 'd': formal parameter with __declspec(align('16')) won't be aligned
inline __m128i foo(__m128i a, __m128i b, __m128i c, __m128i d) {...}
However I was under the impression the stack is not 16byte aligned on x86. Yet some __declspec(align(16)) arrays on the stack didnt even get a warning, and I am sure it must be pushing and popping the __m128's (I recall working out 12 registers were required on x64, and even then it moved some to the stack it didn't need for a bit and did its own thing anyway).
I even added some asserts on the array memory addresses (and turned off NDEBUG) and they all seem to pass.
__declspec(align(16)) uint32_t blocks[64];
assert(((uintptr_t)blocks) % 16 == 0);
__m128i a = ...;
__m128i b = ...;
__m128i c = ...;
__m128i d = ...;
__m128i e = ...;
__m128i f = ...;
__m128i g = ...;
//do other stuff, which surely means there is not enough registers on x86
Did I just get really lucky or is there some magic going on here to realign the stack? And is this portable? I am sure I recall having issues getting some D3DX stuff to align on x86 when I was doing D3D9 back with VS2008.
One thing I did get a bunch of warnings for however was the __m128 -> __m128& conversions being non-standard. Is this really not supported on some compiler that does support SSE, and how is one meant to avoid it (eg inlines with output __m128's, or more than 3 params)?
Also a quick look suggests somehow MS themselves break these rules (eg XMMatrixTransformation http://msdn.microsoft.com/en-us/library/windows/desktop/microsoft.directx_sdk.matrix.xmmatrixtransformation%28v=vs.85%29.aspx takes 6 SSE objects, the only difference I can see being there wrapped in structs)
XMMATRIX XMMatrixTransformation(
[in] XMVECTOR ScalingOrigin,
[in] XMVECTOR ScalingOrientationQuaternion,
[in] XMVECTOR Scaling,
[in] XMVECTOR RotationOrigin,
[in] XMVECTOR RotationQuaternion,
[in] XMVECTOR Translation
);
链接地址: http://www.djcxy.com/p/72734.html
上一篇: 性能AVX / SSE组件与内部函数
下一篇: x86上的SSE,堆栈对齐