bridge and haswell SSE2/AVX/AVX2
I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.
This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.
However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.
Can someone explain this to me?
Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.
Here are FLOPs counts for a number of recent processor microarchitectures and explanation how to achieve them:
Intel Core 2 and Nehalem:
Intel Sandy Bridge/Ivy Bridge:
Intel Haswell/Broadwell/Skylake/Kaby Lake:
AMD K10:
AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):
AMD Ryzen
Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):
AMD Bobcat:
AMD Jaguar:
ARM Cortex-A9:
ARM Cortex-A15:
Qualcomm Krait:
IBM PowerPC A2 (Blue Gene/Q), per core:
IBM PowerPC A2 (Blue Gene/Q), per thread:
Intel Xeon Phi (Knights Corner), per core:
Intel Xeon Phi (Knights Corner), per thread:
Intel Xeon Phi (Knights Landing), per core:
The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.
The throughput for Haswell is lower for addition than for multiplication and FMA. There are two multiplication/FMA units, but only one fp add unit. If your code contains mainly additions then you have to replace the additions by FMA instructions with a multiplier of 1.0 to get the maximum throughput.
The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of fp numbers, you would have to split it in ten parts and use ten accumulator registers.
This is possible indeed, but who would make such a weird optimization for one specific processor?
链接地址: http://www.djcxy.com/p/85652.html上一篇: 融合乘法加法和默认舍入模式