Why my C code is slower using OpenMP

I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?

.
.
.

        omp_set_dynamic(0);
        omp_set_num_threads(4);

        float *h1=new float[nvi];
        float *h2=new float[npi];

        while(tol>0.001)
        {
            std::fill_n(h2, npi, 0);
            int k,i;
            float h222=0;
            #pragma omp parallel for private(i,k) reduction (+: h222)

            for (i=0;i<npi;++i)
            {   
                int p1=ppi[i];
            int m = frombus[p1];
                for (k=0;k<N;++k)
                {
                h222 +=  v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k]) 
                             + B[m-1][k]*sin(del[m-1]-del[k]));
                }
                h2[i]=h222;
            }

            //*********** h3*****************

            std::fill_n(h3, nqi, 0);
            float h333=0;

            #pragma omp parallel for private(i,k) reduction (+: h333) 

            for (int i=0;i<nqi;++i)
            {    
            int q1=qi[i];
            int m = frombus[q1];
                for (int k=0;k<N;++k)
                {
                    h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k]) 
                            - B[m-1][k]*cos(del[m-1]-del[k]));
                } 
                h3[i]=h333;
            }
            .
            .
            .
       }

I don't think your OpenMP code gives the same result as without OpenMP. Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic). There is a dependency of h2[i] on the index i (ie h2[1] = h2[1] + h2[0]). The OpenMP reduction you're doing won't give the correct result. If you want to do the reduction with OpenMP you need do it on the inner loop like this:

float h222 = 0;
for (int i=0; i<npi; ++i) {
    int p1=ppi[i];
    int m = frombus[p1];        
    #pragma omp parallel for reduction(+:h222)
    for (int k=0;k<N; ++k) {
        h222 +=  v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k]) 
                         + B[m-1][k]*sin(del[m-1]-del[k]));
    }
    h2[i] = h222;
}

However, I don't know if that will be very efficient. An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial. Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k . This should give the same result with and without OpenMP and still be fast.

#pragma omp parallel for
for (int i=0; i<npi; ++i) {
    int p1=ppi[i];
    int m = frombus[p1];
    float h222 = 0;
    for (int k=0;k<N; ++k) {
        h222 +=  v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k]) 
                         + B[m-1][k]*sin(del[m-1]-del[k]));
    }
    h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
    h2[i] += h2[i-1];
}    

Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.

链接地址: http://www.djcxy.com/p/79232.html

上一篇: 嵌套openmp循环

下一篇: 为什么我的C代码使用OpenMP更慢