OpenMP slow private function
I want to use OpenMP to parallelize a for-loop inside a function that is called from the main in c++.
My code runs much slower than in sequential mode: The for-loop takes about 6.1s (wall-clock) without OpenMP (just commenting out the #pragma... command), and 11.8s with OpenMP.
My machine has 8 CPUs and 8183Mb of physical storage and is equipped with a 64 bit windows 7 Operating System. I use the visual studio compiler for a 64x bit system in debug mode.
I have read that performance degradation might be due to variables that should be declared as private, but I am unsure how to do this correctly and which of the variables need to be declared as private.
This is the relevant for-loop:
vec DecenterOffsetParallel(const real_1d_array &x22, vec vDistance, double dOffsetXp, double dOffsetYp, double dOffsetXm, double dOffsetYm, double dOffsetXpY, double dOffsetYpX, double dOffsetXmY, double dOffsetYmX, double* dDeltaXp, double* dDeltaYp, double* dDeltaXm, double* dDeltaYm, double* dDeltaXpY, double* dDeltaYpX, double* dDeltaXmY, double* dDeltaYmX, double* delta0, /*local variables for the parallel code: */ const int nRaysn, double PupilDian, mat mRxNn, mat mRyNn, mat mRzNn, mat mRxN1n, mat mRyN1n, mat mRzN1n, mat mRxN2n, mat mRyN2n, mat mRzN2n, mat mRxN3n, mat mRyN3n, mat mRzN3n, mat mRcxNn, mat mRcyNn, mat mRczNn, mat mRcxN1n, mat mRcyN1n, mat mRczN1n, mat mRcxN2n, mat mRcyN2n, mat mRczN2n, mat mRcxN3n, mat mRcyN3n, mat mRczN3n, mat mPathNn, mat mPath1Nn, mat mPath00Nn, mat mPathN1n, mat mPath1N1n, mat mPath00N1n, mat mPathN2n, mat mPath1N2n, mat mPath00N2n, mat mPathN3n, mat mPath1N3n, mat mPath00N3n)
{
#pragma omp parallel for
for (int xy = 0; xy < nRaysn * nRaysn; xy++){
mat temp = mat(nRaysn, nRaysn);
mat mScxn(nRaysn, nRaysn);
mat mScyn(nRaysn, nRaysn);
mat mSczn(nRaysn, nRaysn);
int i = xy / nRaysn;
int j = xy % nRaysn;
// only rays inside entrance pupil:t
if (sqrt(((10.0 / nRaysn) * i - 5.0)*((10.0 / nRaysn) * i - 5.0) + ((10.0 / nRaysn)*j - 5.0) *((10.0 / nRaysn)*j - 5.0)) <= PupilDian / 2.0){
// Initialize the matrices
mRxNn(i, j) = (10.0 / nRaysn) * i - 5.0;
mRyNn(i, j) = (10.0 / nRaysn) * j - 5.0;
mRzNn(i, j) = 0.0;
//... everything is repeated 3 more times to simulate all in all 4 cases...: mRxNn1(i,j) = (10.0/nRaysn)*i-5.0; and so on...
mRcxNn(i, j) = sign(vDistance(0)) *(mRxNn(i, j) - dOffsetYmX) / (sqrt(vDistance(0)*vDistance(0) + (mRxNn(i, j) - dOffsetYmX) * (mRxNn(i, j) - dOffsetYmX) + (mRyNn(i, j) - dOffsetYm) *(mRyNn(i, j) - dOffsetYm)));
mRcyNn(i, j) = sign(vDistance(0)) *(mRyNn(i, j) - dOffsetYm) / (sqrt(vDistance(0)*vDistance(0) + (mRxNn(i, j) - dOffsetYmX) * (mRxNn(i, j) - dOffsetYmX) + (mRyNn(i, j) - dOffsetYm) *(mRyNn(i, j) - dOffsetYm)));
mRczNn(i, j) = sqrt(1 - mRcxNn(i, j)*mRcxNn(i, j) - mRcyNn(i, j)*mRcyNn(i, j));
mPathNn(i, j) = 0.0;
mPath1Nn(i, j) = sign(vDistance(0)) *nAir * vDistance(0) / mRczNn(i, j);
mPath00Nn(i, j) = mPath1Nn(i, j);
//... everything is repeated 3 more times to simulate 4 different cases...
//trace rays through cornea
temp(i, j) = RayIntersect(ZernAnt, ZernRadAnt, &mRxNn(i, j), &mRyNn(i, j), P2DAnt, UAnt, VAnt, &mRzNn(i, j), mRcxNn(i, j), mRcyNn(i, j), mRczNn(i, j), &mPathNn(i, j), xNullAnt, yNullAnt, NknotsUAnt, NknotsVAnt); // find the intersection (modifies mRz, mRy, mRx, mPath)
mPathNn(i, j) = mPath1Nn(i, j) + nAir*mPathNn(i, j);
temp(i, j) = Surface(P2DAnt, UAnt, VAnt, ZernAnt, ZernRadAnt, mRxNn(i, j), mRyNn(i, j), mRzNn(i, j), &mScxn(i, j), &mScyn(i, j), &mSczn(i, j), KnotIntervallSizeAnt, xNullAnt, yNullAnt);
// *Ant are identical for all four cases!
temp(i, j) = Refract(nAir, nCornea, &mRcxNn(i, j), &mRcyNn(i, j), &mRczNn(i, j), mScxn(i, j), mScyn(i, j), mSczn(i, j));
//... everything is repeated 3 more times to simulate all in all 4 cases...
}
else{
mRxNn(i, j) = mRyNn(i, j) = mRzNn(i, j) = mRcxNn(i, j) = mRcyNn(i, j) = mRczNn(i, j) = mPathNn(i, j) = mPath1Nn(i, j) = NAN;
//... everything is repeated 3 more times to simulate all in all 4 cases...
}
}
// some other stuff, that is not relevant to the questions...
}
Can anyone give me a hint, what might cause the performance degradation? Thank you!
PS: Armadillo library is used for the matrices and vectors.
You don't need to declare any private variables in the parallel construct because all the variables are either read-only ( nS
, Dia
, etc.) or need to be shared ( mX
, mY
, etc.).
The question about performance degradation, you should provide more information as @Zulan explained in the comments. A very useful metric here would be how much times your function gets called and which portion of the total time of the application it takes (to check that this is actually a hotspot). Having application execution time and accumulated function time would be great.
You can do this with many tools such as gprof
's call graph, but run it without OpenMP.
上一篇: omp在ubuntu c ++中使用
下一篇: OpenMP缓慢的私人功能