Optimization for Matrix Multiply (OpenMP)
I am learning a little bit about openMP and trying to use it here to multiply two matrices together.
void matrix_multiply(matrix *A, matrix *B, matrix *C) {
#pragma omp parallel
{
#pragma omp for
for(int i = 0; i < A->dim.rows; i++) {
for(int j = 0; j < B->dim.cols; j++) {
C->data[i][j] = 0;
for (int k = 0; k < A->dim.cols; k++) {
C->data[i][j] += A->data[i][k] * B->data[k][j];
}
}
}
}
}
typedef struct shape {
int rows;
int cols;
} shape;
typedef struct matrix {
shape dim;
float** data;
} matrix;
Still a little new to this, so are there any simple changes to improve performance or have I already done that? Also am I running into any data races by not using reduction?
Your current implementation can probably not be improved a lot. At this point it comes down to the compiler and cache usage. An interesting point is made here by Intel that GCC requires two loops to be swapped in order to vectorize the multiplications (ie use SIMD). For very large matrices, you might consider dividing the matrices not in stripes but in blocks. This introduces complexity and overhead, but can improve cache usage.
The reduction clause is only needed if you are summing a single variable with multiple threads, which is not the case here since you only sum over k
.
Finally (but this is completely personal) note that you can replace the two directives by a single one
#pragma omp parallel for
which in my opinion looks somewhat cleaner.
链接地址: http://www.djcxy.com/p/85994.html上一篇: 重复角度js
下一篇: 矩阵乘法优化(OpenMP)