OpenMP Performance Issues with Matrix Multiplication

I am having issues with the performance using OpenMp. I am trying to test the results of a single threaded program not using OpenMP and an app using OpenMP. By looking at results online that are comparing matrix chain multiplication programs the openMP implementation is 2 to 3 times as fast, but my implementation is the same speed for both apps. Is the way I am implementing openMP incorrect? Any pointers on openMP and how to correctly implement it? Any help is much appreciated. Thanks in advance.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main( int argc , char *argv[] ) 
{
   srand(time(0));
   if ( argc != 2 )
   {
      printf("Usage: %s <size of nxn matrices>n", argv[0]);
      return 1; 
   }

   int n = atoi( argv[1] );
   int a, b;
   double A[n][n], B[n][n], C[n][n];
   FILE *fp;
   fp = fopen("/home/mkj0002/CPE631/Homework2/ArrayTry/matrixResults", "w+"); //For the LeCASA machine

   for(a = 0; a < n; a++)
   {
       for(b = 0; b < n; b++)
       {
          A[a][b] = ((double)rand()/(double)RAND_MAX);  //Number between 0 and 1
          A[a][b] = (double)rand();         //Number between 0 and RAND_MAX
          B[a][b] = ((double)rand()/(double)RAND_MAX);  //Number between 0 and 1
          B[a][b] = (double)rand();         //Number between 0 and RAND_MAX
          C[a][b] = 0.0;
       }
    }

    #pragma omp parallel shared(A,B,C)
    {
        int i,j,k;
        #pragma omp for schedule(guided,n)
        for(i = 0; i < n; ++i)
        {
            for(j = 0; j < n; ++j)
            {
                double sum = 0;
                for(k = 0; k < n; ++k)
                {
                    sum += A[i][k] * B[k][j];
                }

                C[i][j] = sum;
                fprintf(fp,"0.4lf",C[i][j]);
            }
        }
    }

    if(fp)
    {
        fclose(fp);
    }
    fp = NULL;

    return 0;
}                  

(1) Don't perform I/O inside your parallel region. You'll see instantaneous speedup when you move that out and write many C variables simultaneously to file.

(2) After you've done the above, you should then change your scheduling to static because each loop will be doing the exact same amount of computations and there's no longer a need to incur the overhead from fancy scheduling.

(3) Furthermore, to better utilize caching, you should swap your j and k loops. To see this, imagine accessing just your B variable in your current loops.

for(j = 0; j < n; ++j)
{
    for(k = 0; k < n; ++k)
    {
        B[k][j] += 5.0;
    }
}

You can see how this accesses B as if it was stored in Fortran's column-major format. More info can be found here. A better alternative is:

for(k = 0; k < n; ++k)
{
    for(j = 0; j < n; ++j)
    {
        B[k][j] += 5.0;
    }
}

Coming back to your example though, we still have to deal with the sum variable. An easy suggestion would be storing the row of current sum s you're computing and then saving them all once you're done with your current loop.

Combining all 3 steps, we get something like:

#pragma omp parallel shared(A,B,C)
{
    int i,j,k;
    double sum[n]; // one for each j

    #pragma omp for schedule(static)
    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n; ++j)
            sum[j] = 0;

        for(k = 0; k < n; ++k)
        {
            for(j = 0; j < n; ++j)
            {
                sum[j] += A[i][k] * B[k][j];
            }
        }

        for(j = 0; j < n; ++j)
            C[i][j] = sum[j];
    }
}

// perform I/O here using contiguous blocks of C variable

Hope that helps.

EDIT : As per @Zboson's suggestion, it would be even easier to simply remove sum[j] entirely and replace it with C[i][j] throughout the program.

链接地址: http://www.djcxy.com/p/85990.html

上一篇: 如何乘以一个4x4矩阵与C中的1x3矩阵?

下一篇: 矩阵乘法的OpenMP性能问题