Why does Intel's compiler prefer NEG+ADD over SUB?

In examining the output of various compilers for a variety of code snippets, I've noticed that Intel's C compiler (ICC) has a strong tendency to prefer emitting a pair of NEG + ADD instructions where other compilers would use a single SUB instruction.

As a simple example, consider the following C code:

uint64_t Mod3(uint64_t value)
{
    return (value % 3);
}

ICC translates this to the following machine code (regardless of optimization level):

mov       rcx, 0xaaaaaaaaaaaaaaab
mov       rax, rdi
mul       rcx
shr       rdx, 1
lea       rsi, QWORD PTR [rdx+rdx*2]
neg       rsi                            ;   equivalent to:
add       rdi, rsi                       ; /    sub  rdi, rsi
mov       rax, rdi
ret         

Whereas other compilers (including MSVC, GCC, and Clang) will all generate essentially equivalent code, except that the NEG + ADD sequence is replaced by a single SUB instruction.

Like I said, this isn't just a quirk of how ICC compiles this particular snippet. It's a pattern I've observed repeatedly when analyzing the disassembly for arithmetic operations. I normally wouldn't think much of this, except that ICC is known to be a pretty good optimizing compiler and it is developed by folks that have insider information about their microprocessors.

Could there be something that Intel knows about the implementation of the SUB instruction on their processors that makes it more optimal to decompose it into NEG + ADD instructions? Using RISC-style instructions that decode into simpler µops is well-known optimization advice for modern microarchitectures, so is it possible that SUB is broken down internally into individual NEG and ADD µops, and that it is actually more efficient for the front-end decoder to use these "simpler" instructions? Modern CPUs are complicated, so anything is possible.

Agner Fog's comprehensive instruction tables confirm my intuition, though, that this is actually a pessimization. SUB is equally as efficient as ADD on all processors, so the additional required NEG instruction just serves to slow things down.

I also ran the two sequences through Intel's own Architecture Code Analyzer to analyze the throughput. Though the exact cycle counts and port bindings vary from one microarchitecture to another, a single SUB appears to be superior in every respect from Nehalem to Broadwell. Here are the two reports generated by the tool for Haswell:

SUB
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 1.85 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.5  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.8  | 1.7  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.2 |           |           |     | 0.3 | 0.4 |     | CP | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 0.9       |     |           |           |     |     | 0.1 |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.3 |           |           |     | 0.4 | 0.2 |     | CP | sub rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 7
NEG+ADD
Intel(R) Architecture Code Analyzer Version - 2.2 build:356c3b8 (Tue, 13 Dec 2016 16:25:20 +0200)
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

Throughput Analysis Report
--------------------------
Block Throughput: 2.15 Cycles       Throughput Bottleneck: Dependency chains (possibly between iterations)

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.1    0.0  | 2.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 2.0  | 2.0  | 0.0  |
---------------------------------------------------------------------------------------

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    | 0.1       | 0.9 |           |           |     | 0.1 | 0.1 |     |    | mov rax, 0xaaaaaaaaaaaaaaab
|   2    |           | 1.0 |           |           |     |     | 1.0 |     | CP | mul rcx
|   1    | 1.0       |     |           |           |     |     |     |     | CP | shr rdx, 0x1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | lea rax, ptr [rdx+rdx*2]
|   1    |           | 0.1 |           |           |     | 0.8 | 0.1 |     | CP | neg rax
|   1    | 0.1       |     |           |           |     | 0.1 | 0.9 |     | CP | add rcx, rax
|   1*   |           |     |           |           |     |     |     |     |    | mov rax, rcx
Total Num Of Uops: 8

So, as far as I can tell, NEG + ADD increases the code size, increases the number of µops, increases pressure for execution ports, and increases the number of cycles, thus resulting in a net decrease in throughput compared to SUB . So why is Intel's compiler doing this?

Is this just some quirk of the code generator that should be reported as a defect, or am I missing some merit in my analysis?


Strangely I have a simple answer: Because ICC isn't optimal.

When you write own compiler you get started with some very basic set of operation codes: NOP , MOV , ADD ... up to 10 opcodes. You don't use SUB for a while because it might easily be replaced by: ADD NEGgative operand . NEG isn't basic as well, as it might be replaced by: XOR FFFF...; ADD 1 XOR FFFF...; ADD 1 .

So you implement rather complex bit-based addressing of operand types and sizes. You do it for a single machine code instruction (eg. ADD ) and plan to use it further for most other instructions. But by this time your co-worker finishes implementation of optimal calculation of remainder without use of SUB ! Imagine - it's already called "Optimal_Mod" so you miss some inoptimal thing inside not because you're a bad guy and hate AMD but just because you see - it's already called optimal, optimized.

Intel Compiler is pretty good in general, but it has a long version history, so it can behave strange in some rare cases. I suggest you inform Intel about this issue and look what will happen.

链接地址: http://www.djcxy.com/p/2466.html

上一篇: 如何在我的gcc C ++程序崩溃时自动生成堆栈跟踪

下一篇: 为什么英特尔编译器更喜欢NEG + ADD而不是SUB?