Understanding FMA instructions performance

i'm tring to understand how can i max out the number of operations i can get on my CPU. I'm doing a simple matrix multiplication program, and i have a Skylake processor. I was looking at the wikipedia page for the flops information on this architecture, and i'm having dificulties understanding it.

From my understanding, FMA instructions allow 3 way FP inputs right? And allow to mix between adds and multiplies between them. But what happens when i only add two floats? Does it simply multiply it by one? Can i add 3 floats in 1 cycle, or will that be split? I saw that the skylake, has 32 FLOPs/cycle for single precision inputs, but what's the meaning of " two 8-wide FMA instructions "?

Thank you in advance for the explanations


FMA calculates ± a*b ± c in a single operation, with a single rounding error. That's what it does, nothing else. Calculating a + b + c cannot be done using an FMA instruction; you need two dependent ADD operations for that.

Depending on the compiler, you may have to turn a compiler option to allow use of FMA instructions, because they don't give results identical to multiply followed by add. And you may have to re-arrange your code in some cases, for example ab + cd + e will be calculated as x = ab; y = FMA (c, d, x), z = y + e but e + ab + c*d will be calculated as x = FMA (a, b, e); z = FMA (c, d, x). The basic operation calculation of an FFT can be performed with eight floating-point operations and can be rewritten as 10 operations using four FMAs and two other operations.

"Two 8-wide FMA instructions" means it can perform FMA instructions with two 256 bit vector registers containing 8 floats each, and two of these in the same cycle.

链接地址: http://www.djcxy.com/p/85662.html

上一篇: 为最近的CPU架构生成每个周期的加载/存储

下一篇: 了解FMA指令的性能