AVX2

2018-06-30 16:53:54

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.

This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.

However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.

Can someone explain this to me?

Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.

Here are FLOPs counts for a number of recent processor microarchitectures and explanation how to achieve them:

Intel Core 2 and Nehalem:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication

8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

Intel Sandy Bridge/Ivy Bridge:

8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication

16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication

Intel Haswell/Broadwell/Skylake/Kaby Lake:

16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions

32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions

AMD K10:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication

8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

8 DP FLOPs/cycle: 4-wide FMA

16 SP FLOPs/cycle: 8-wide FMA

AMD Ryzen

8 DP FLOPs/cycle: 4-wide FMA

16 SP FLOPs/cycle: 8-wide FMA

Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle

6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

AMD Bobcat:

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle

4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle

AMD Jaguar:

3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles

8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

ARM Cortex-A9:

1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle

4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle

ARM Cortex-A15:

2 DP FLOPs/cycle: scalar FMA or scalar multiply-add

8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

Qualcomm Krait:

2 DP FLOPs/cycle: scalar FMA or scalar multiply-add

8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

IBM PowerPC A2 (Blue Gene/Q), per core:

8 DP FLOPs/cycle: 4-wide QPX FMA every cycle

SP elements are extended to DP and processed on the same units

IBM PowerPC A2 (Blue Gene/Q), per thread:

4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle

SP elements are extended to DP and processed on the same units

Intel Xeon Phi (Knights Corner), per core:

16 DP FLOPs/cycle: 8-wide FMA every cycle

32 SP FLOPs/cycle: 16-wide FMA every cycle

Intel Xeon Phi (Knights Corner), per thread:

8 DP FLOPs/cycle: 8-wide FMA every other cycle

16 SP FLOPs/cycle: 16-wide FMA every other cycle

Intel Xeon Phi (Knights Landing), per core:

32 DP FLOPs/cycle: two 8-wide FMA every cycle

64 SP FLOPs/cycle: two 16-wide FMA every cycle

The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.

The throughput for Haswell is lower for addition than for multiplication and FMA. There are two multiplication/FMA units, but only one fp add unit. If your code contains mainly additions then you have to replace the additions by FMA instructions with a multiplier of 1.0 to get the maximum throughput.

The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of fp numbers, you would have to split it in ten parts and use ten accumulator registers.

This is possible indeed, but who would make such a weird optimization for one specific processor?

链接地址: http://www.djcxy.com/p/85652.html

上一篇: 融合乘法加法和默认舍入模式

下一篇: 桥梁和haswell SSE2 / AVX / AVX2