Why does System V / AMD64 ABI mandate a 16 byte stack alignment?
I've read in different places that it is done for "performance reasons", but I still wonder what are the particular cases where performance get improved by this 16-byte alignment. Or, in any case, what were the reasons why this was chosen.
edit : I'm thinking I wrote the question in a misleading way. I wasn't asking about why the processor does things faster with 16-byte aligned memory, this is explained everywhere in the docs. What I wanted to know instead, is how the enforced 16-byte alignment is better than just letting the programmers align the stack themselves when needed. I'm asking this because from my experience with assembly, the stack enforcement has two problems: it is only useful by less 1% percent of the code that is executed (so in the other 99% is actually overhead); and it is also a very common source of bugs. So I wonder how it really pays off in the end. While I'm still in doubt about this, I'm accepting peter's answer as it contains the most detailed answer to my original question.
SSE2 is baseline for x86-64, and making the ABI efficient for types like __m128
, and for compiler auto-vectorization, was one of the design goals, I think. The ABI has to define how such args are passed as function args, or by reference.
16-byte alignment is sometimes useful for local variables on the stack (especially arrays), and guaranteeing 16-byte alignment means compilers can get it for free whenever it's useful, even if the source doesn't explicitly request it.
If the stack alignment relative to a 16-byte boundary wasn't known, every function that wanted an aligned local would need an and rsp, -16
, and extra instructions to save/restore rsp
after an unknown offset to rsp
(either 0
or -8
). eg using up rbp
for a frame pointer.
Without AVX, memory source operands have to be 16-byte aligned. eg paddd xmm0, [rsp+rdi]
faults if the memory operand is misaligned. So if alignment isn't known, you'd have to either use movups xmm1, [rsp+rdi]
/ paddd xmm0, xmm1
, or write a loop prologue / epilogue to handle the misaligned elements. For local arrays that the compiler wants to auto-vectorize over, it can simply choose to align them by 16.
Also note that early x86 CPUs (before Nehalem / Bulldozer) had a movups
instruction that's slower than movaps
even when the pointer does turn out to be aligned. (ie unaligned loads/stores on aligned data was extra slow, as well as preventing folding loads into an ALU instruction). (See Agner Fog's optimization guides, microarch guide, and instruction tables for more about all of the above.)
These factors are why a guarantee is more useful than just "usually" keeping the stack aligned. Being allowed to make code which actually faults on a misaligned stack allows more optimization opportunities.
Aligned arrays also speed up vectorized memcpy
/ strcmp
/ whatever functions that can't assume alignment, but instead check for it and can jump straight to their whole-vector loops.
From a recent version of the x86-64 System V ABI (r252):
An array uses the same alignment as its elements, except that a local or global array variable of length at least 16 bytes or a C99 variable-length array variable always has alignment of at least 16 bytes.4
4 The alignment requirement allows the use of SSE instructions when operating on the array. The compiler cannot in general calculate the size of a variable-length array (VLA), but it is expected that most VLAs will require at least 16 bytes, so it is logical to mandate that VLAs have at least a 16-byte alignment.
This is a bit aggressive, and mostly only helps when functions that auto-vectorize can be inlined, but usually there are other locals the compiler can stuff into any gaps so it doesn't waste stack space. And doesn't waste instructions as long as there's a known stack alignment. (Obviously the ABI designers could have left this out if they'd decided not to require 16-byte stack alignment.)
Spill/reload of __m128
Of course, it makes it free to do alignas(16) char buf[1024];
or other cases where the source requests 16-byte alignment.
And there are also __m128
/ __m128d
/ __m128i
locals. The compiler may not be able to keep all vector locals in registers (eg spilled across a function call, or not enough registers), so it needs to be able to spill/reload them with movaps
, or as a memory source operand for ALU instructions, for efficiency reasons discussed above.
Loads/stores that actually are split across a cache-line boundary (64 bytes) have significant latency penalties, and also minor throughput penalties on modern CPUs. The load needs data from 2 separate cache lines, so it takes two accesses to the cache. (And potentially 2 cache misses, but that's rare for stack memory).
I think movups
already had that cost baked in for vectors on older CPUs where it's expensive, but it still sucks. Spanning a 4k page boundary is much worse (on CPUs before Skylake), with a load or store taking ~100 cycles if it touches bytes on both sides of a 4k boundary. (Also needs 2 TLB checks). Natural alignment makes splits across any wider boundary impossible , so 16-byte alignment was sufficient for everything you can do with SSE2.
max_align_t
has 16-byte alignment in the x86-64 System V ABI, because of long double
(10-byte/80-bit x87). It's defined as padded to 16 bytes for some weird reason, unlike in 32-bit code where sizeof(long double) == 10
. x87 10-byte load/store is quite slow anyway (like 1/3rd the load throughput of double
or float
on Core2, 1/6th on P4, or 1/8th on K8), but maybe cache-line and page split penalties were so bad on older CPUs that they decided to define it that way. I think on modern CPUs (maybe even Core2) looping over an array of long double
would be no slower with packed 10-byte, because the fld m80
would be a bigger bottleneck than a cache-line split every ~6.4 elements.
Actually, the ABI was defined before silicon was available to benchmark on (back in ~2000), but those K8 numbers are the same as K7 (32-bit / 64-bit mode is irrelevant here). Making long double
16-byte does make it possible to copy a single one with movaps
, even though you can't do anything with it in XMM registers. (Except manipulate the sign bit with xorps
/ andps
/ orps
)
Related: this max_align_t
definition means that malloc
always returns 16-byte aligned memory in x86-64 code. This lets you get away with using it for SSE aligned loads like _mm_load_ps
, but such code can break when compiled for 32-bit where alignof(max_align_t)
is only 8. (Use aligned_alloc
or whatever).
Other ABI factors include passing __m128
values on the stack (after xmm0-7 have the first 8 float / vector args). It makes sense to require 16-byte alignment for vectors in memory, so they can be used efficiently by the callee, and stored efficiently by the caller. Maintaining 16-byte stack alignment at all times makes it easy for functions that need to align some arg-passing space by 16.
There are types like __m128
that the ABI guarantees have 16-byte alignment . If you define a local and take its address, and pass that pointer to some other function, that local needs to be sufficiently aligned. So maintaining 16-byte stack alignment goes hand in hand with giving some types 16-byte alignment, which is obviously a good idea.
These days, it's nice that atomic<struct_of_16_bytes>
can cheaply get 16-byte alignment, so lock cmpxchg16b
doesn't ever cross a cache line boundary. For the really rare case where you have an atomic local with automatic storage, and you pass pointers to it to multiple threads...
上一篇: 64个汇编输出与GCC?