Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD

2018-06-06 16:02:26

Problem

I am studying high performance matrix multiplication algorithms such as OpenBLAS or GotoBLAS and I am trying to reproduce some of the results. This question deals with the inner kernel of a matrix multiplication algorithm. Specifically, I am looking at computing C += AB , where A and B are 2x2 matrices of type double at the peak speed of my CPU. There are two ways to do this. One way is to use SIMD instructions. The second way is to code directly in assembly using the SIMD registers.

What I have looked at so far

All the relevant papers, course webpages, many many SO Q&As dealing with the subject (too many to list), I have compiled OpenBLAS on my computer, looked through OpenBLAS, GotoBLAS, and BLIS source codes, Agner's manuals.

Hardware

My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops . For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops .

Setup

I declare my 2x2 matrices as double and initialize them with random entries as shown in the code snippet below.

srand(time(NULL));
const int n = 2;
double A[n*n];
double B[n*n];
double C[n*n];
double T[n*n];
for(int i = 0; i < n*n; i++){
    A[i] = (double) rand()/RAND_MAX;
    B[i] = (double) rand()/RAND_MAX;
    C[i] = 0.0;
}

I compute a true answer using naive matrix-matrix multiplcation (shown below) which allows me to check my result either visually or by computing the L2 norm of all the elements

// "true" answer
for(int i = 0; i < n; i++)
    for(int j = 0; j < n; j++)
        for(int k = 0; k < n; k++)
            T[i*n + j] += A[i*n + k]*B[k*n + j];

To run the code and get an estimate of the Gflops, I call each multiplication function once to warmup, and then execute it inside a for loop for maxiter times, making sure to zero the C matrix each time as I am computing C += AB . The for loop is placed inside two clock() statements and this is used to estimate the Gflops. The code snippet blow illustrates this part.

C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
mult2by2(A,B,C); //warmup
time1 = clock();
for(int i = 0; i < maxiter; i++){
        mult2by2(A,B,C);
        C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
time2 = clock() - time1;
time3 = (double)(time2)/CLOCKS_PER_SEC;
gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
mult2by2(A,B,C); // to compute the norm against T
norm = L2norm(n,C,T);

SIMD code

My CPU supports 128-bit vectors, so I can fit 2 double s in each vector. This is the main reason why I am doing 2x2 matrix multiplication in the inner kernel. The SIMD code computes an entire row of C at one time.

    inline void 
    __attribute__ ((gnu_inline))        
    __attribute__ ((aligned(16))) mult2by2B(        
            const double* restrict A,
            const double* restrict B,
            double* restrict C
        )

    {

    register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
    xmm0 = _mm_load_pd(C);
    xmm1 = _mm_load1_pd(A);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 1);
    xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C,xmm2);

    xmm0 = _mm_load_pd(C + 2);
    xmm1 = _mm_load1_pd(A + 2);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 3);
    //xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C + 2,xmm2);
}

Assmebly (Intel Syntax)

My first attempt was to create a separate assembly routine for this part and call it from the main routine. However, it was extremely slow because I cannot inline extern functions. I wrote the assembly as inline assembly as shown below. It is identical to that which is produced by gcc -S -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel . From what I understand of the Nehalem microarchitecture diagrams, this processor can perform SSE ADD , SSE MUL , and SSE MOV in parallel, which explains the interleaving of MUL , ADD , MOV instructions. You will notice the SIMD instructions above are in a different order because I had a different understanding from Agner Fog's manuals. Nevertheless, gcc is smart and the SIMD code above compiles to the assembly shown in the inline version.

inline void 
__attribute__ ((gnu_inline))        
__attribute__ ((aligned(16))) mult2by2A
    (   
        const double* restrict A,
        const double* restrict B,
        double* restrict C
    )
    {
    __asm__ __volatile__
    (
    "mov        edx, %[A]                   nt"
    "mov        ecx, %[B]                   nt"
    "mov        eax, %[C]                   nt"
    "movapd     xmm3, XMMWORD PTR [ecx]     nt"
    "movapd     xmm2, XMMWORD PTR [ecx+16]  nt"
    "movddup    xmm1, QWORD PTR [edx]       nt"
    "mulpd      xmm1, xmm3                  nt"
    "addpd      xmm1, XMMWORD PTR [eax]     nt"
    "movddup    xmm0, QWORD PTR [edx+8]     nt"
    "mulpd      xmm0, xmm2                  nt"
    "addpd      xmm0, xmm1                  nt"
    "movapd     XMMWORD PTR [eax], xmm0     nt"
    "movddup    xmm4, QWORD PTR [edx+16]    nt"
    "mulpd      xmm4, xmm3                  nt"
    "addpd      xmm4, XMMWORD PTR [eax+16]  nt"
    "movddup    xmm5, QWORD PTR [edx+24]    nt"
    "mulpd      xmm5, xmm2                  nt"
    "addpd      xmm5, xmm4                  nt"
    "movapd     XMMWORD PTR [eax+16], xmm5  nt"
    : // no outputs 
    : // inputs
    [A] "m" (A),
    [B] "m" (B), 
    [C] "m" (C)
    : //register clobber
    "memory",
    "edx","ecx","eax",
    "xmm0","xmm1","xmm2","xmm3","xmm4","xmm5"
    );
}

Results

I compile my code with the following flags:

gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel

The results for maxiter = 1000000000 are below:

********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 9.563000, Avg. Gflops: 1.673115

********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 0.359000, Avg. Gflops: 44.568245

If I force the SIMD version to not be inlined with __attribute__ ((noinline)) , the results are:

********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 11.155000, Avg. Gflops: 1.434334

********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 11.264000, Avg. Gflops: 1.420455

Questions

If both the inline ASM and SIMD implementations produce identical assembly output, why is the assembly version so much slower? It's as if the inline assembly did not get inlined, which is made evident by the second set of results showing identical performance for "inline" ASM versus "noinline" SIMD. The only explanation I can find is in Agner Fog Volume 2 page 6:

Compiled code may be faster than assembly code because compilers can make inter-procedural optimization and whole-program optimization. The assembly programmer usually has to make well-defined functions with a well-defined call interface that obeys all calling conventions in order to make the code testable and verifiable. This prevents many of the optimization methods that compilers use, such as function inlining, register allocation, constant propagation, common subexpression elimination across functions, scheduling across functions, etc. These advantages can be obtained by using C++ code with intrinsic functions instead of assembly code.

But the assembler output for both versions is exactly the same.

Why am I seeing 44 Gflops in the first set of results? This is way above the 12 Gflops peak I calculated, and is what I would expect if I was running both cores with single precision calculations.

EDIT 1 The comment says there might be dead code elimination I can confirm that this is happening for the SIMd instructions. The -S output shows that the for loop for SIMD only zeros C matrix. I can disable that by turning off compiler optimization with -O0 . In that case, SIMD runs 3x as slow as ASM, but ASM still runs at exactly the same speed. The norm is also nonzero now, but it's still OK at 10^-16. I also see that the inline ASM version is being inlined with the APP and NO_APP tags, but it is also being unrolled 8 times in the for loop. I think unrolling that many times will impact performance heavily, as I usually unroll loops 4 times. Anything more, in my experience, seems to degrade performance.

GCC is optimizing away your inline function using intrinsics, mult2by2B , due to the line

C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;

Without that line it takes 2.9 seconds on the computer from Coliru http://coliru.stacked-crooked.com/a/992304f5f672e257

And with the line it only takes 0.000001 http://coliru.stacked-crooked.com/a/9722c39bb6b8590a

You can also see this in the assembly. If you drop the code below into http://gcc.godbolt.org/ you will see that with that line of code it skips the function entirely.

However, when you inline the assembly GCC is NOT optimizing the function, mult2by2A , away (even though it inlines it). You can see this in the assembly as well.

#include <stdio.h>
#include <emmintrin.h>                 // SSE2
#include <omp.h>

inline void 
    __attribute__ ((gnu_inline))        
    __attribute__ ((aligned(16))) mult2by2B(        
            const double* __restrict A,
            const double* __restrict B,
            double* __restrict C
        )

    {

    register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
    xmm0 = _mm_load_pd(C);
    xmm1 = _mm_load1_pd(A);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 1);
    xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C,xmm2);

    xmm0 = _mm_load_pd(C + 2);
    xmm1 = _mm_load1_pd(A + 2);
    xmm2 = _mm_load_pd(B);
    xmm3 = _mm_load1_pd(A + 3);
    //xmm4 = _mm_load_pd(B + 2);
    xmm1 = _mm_mul_pd(xmm1,xmm2);
    xmm2 = _mm_add_pd(xmm1,xmm0);
    xmm1 = _mm_mul_pd(xmm3,xmm4);
    xmm2 = _mm_add_pd(xmm1,xmm2);
    _mm_store_pd(C + 2,xmm2);
}

int main() {
  double A[4], B[4], C[4];
  int maxiter = 10000000;
  //int maxiter = 1000000000;
  double dtime;
  dtime = omp_get_wtime();
  for(int i = 0; i < maxiter; i++){
        mult2by2B(A,B,C);
        C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
  }
  dtime = omp_get_wtime() - dtime;
  printf("%f %f %f %fn", C[0], C[1], C[2], C[3]);
  //gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
  printf("time %fn", dtime);
}

链接地址: http://www.djcxy.com/p/20634.html

上一篇: Angular Route重定向到404

下一篇: 优化的2x2矩阵乘法：与快速SIMD相比，组装速度慢