Dependency chain analysis
From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:
[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.
There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence because the results from each iteration are saved to memory and not reused in the next iteration. The out-of-order execution mechanism and pipelining makes it possible that each calculation can start before the preceding calculation is finished. The only loop-carried dependency chain is add eax,16
which has a latency of only 1.
## Example 12.6b. DAXPY algorithm, 32-bit mode
[...] ; not shown: initialize some regs before the loop
L1:
movapd xmm1, [esi+eax] ; X[i], X[i+1]
mulpd xmm1, xmm2 ; X[i] * DA, X[i+1] * DA
movapd xmm0, [edi+eax] ; Y[i], Y[i+1]
subpd xmm0, xmm1 ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
movapd [edi+eax], xmm0 ; Store result
add eax, 16 ; Add size of two elements to index
cmp eax, ecx ; Compare with n*8
jl L1 ; Loop back
I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.
I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.
I also cannot understand why the author doesn't care about the dependency here: add eax, 16 -> cmp eax, ecx -> jl L1
After all, addition must be executed before cmp
and cmp
must be executed before jl
.
PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.
the mul isn't part of a loop-carried dependency chain, so there can be mulpd
insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.
The cmp
/ jl
in each iteration depend on the add
from that iteration, but the add
in the next iteration doesn't depend on the cmp
. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before the jl
from the preceding iteration retires.
By comparison, cmov
is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.
Each loop iteration has a separate cmp
/ jl
dependency chain, just like the FP dependency chain.
I cannot understand why dependency chain doesn't increase a whole throughput.
I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (eg "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.
链接地址: http://www.djcxy.com/p/28786.html上一篇: 改变一个整数的位
下一篇: 依赖链分析