Modern x86 cost model

2018-06-29 12:09:19

I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go. I used ARM assembler about 20 years ago and am surprised by the difference in cost models between these architectures.

Specifically, memory accesses and branches are expensive on ARM but the equivalent stack operations and jumps are cheap on x86. I believe modern x86 CPUs do far more dynamic optimizations than ARM cores do and I find it difficult to anticipate their effects.

What is a good cost model to bear in mind when writing x86 assembler? Which combinations of instructions are cheap and which are expensive?

For example, my compiler would be simpler if it always generated the long form for loading integers or jumping to offsets even if the integers were small or the offsets close but would this impact performance?

I haven't done any floating point yet but I'd like to get on to it soon. Is there anything not obvious about the interaction between normal and float code?

I know there are lots of references (eg Michael Abrash) on x86 optimization but I have a hunch than anything more than a few years old will not apply to modern x86 CPUs because they have changed so much lately. Am I correct?

The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples.

Another excellent reference is Agner Fog's optimization resources, which have the virtue of also covering AMD cores.

Note that specific cost models are, by nature, micro-architecture specific. There's no such thing as an "x86 cost model" that has any kind of real validity. At the instruction level, the performance characteristics of Atom are wildly different from i7.

I would also note that memory accesses and branches are not actually "cheap" on x86 cores -- it's just that the out-of-order execution model has become so sophisticated that it can successfully hide the cost of them in many simple scenarios.

Torbjörn Granlund's Instruction latencies and throughput for AMD and Intel x86 processors is good too.

Edit

Granlund's document concerns instruction throughput in the context of how many instructions of a certain type can be issued per clock cycle (ie performed in parallell). He also claims that intel's documentation isn't always accurate.

For what it's worth, there used to be an amazing book called "Inner Loops" by Rick Booth that described in great detail how to manually micro-optimize IA-86 assembly code for Intel's 80486, Pentium, Pentium Pro, and Pentium MMX processors, with lots of useful real-world code examples (hashing, moving memory, random number generation, Huffman and JPEG compression, matrix multiplication).

Unfortunately, the book hasn't been updated ever since its first publication in 1997 for newer processors and CPU architectures. Nevertheless, I would still recommend it as a gentle introduction to topics such as:

which instructions are generally very cheap, or cheap, and which aren't

which registers are the most versatile (ie have no special meaning / aren't the default register of some instructions)

how to pair instructions so that they are executed in parallel without stalling one pipeline

different kinds of stalls

branch prediction

what to keep in mind with regard to processor caches

链接地址: http://www.djcxy.com/p/82364.html

上一篇: CPU的汇编语言各不相同？

下一篇: 现代x86成本模型