Why my C code is slower using OpenMP
I m trying to do multi-thread programming on CPU using OpenMP. I have lots of for loops which are good candidate to be parallel. I attached here a part of my code. when I use first #pragma omp parallel for reduction, my code is faster, but when I try to use the same command to parallelize other loops it gets slower. does anyone have any idea why it is like this?
.
.
.
        omp_set_dynamic(0);
        omp_set_num_threads(4);
        float *h1=new float[nvi];
        float *h2=new float[npi];
        while(tol>0.001)
        {
            std::fill_n(h2, npi, 0);
            int k,i;
            float h222=0;
            #pragma omp parallel for private(i,k) reduction (+: h222)
            for (i=0;i<npi;++i)
            {   
                int p1=ppi[i];
            int m = frombus[p1];
                for (k=0;k<N;++k)
                {
                h222 +=  v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k]) 
                             + B[m-1][k]*sin(del[m-1]-del[k]));
                }
                h2[i]=h222;
            }
            //*********** h3*****************
            std::fill_n(h3, nqi, 0);
            float h333=0;
            #pragma omp parallel for private(i,k) reduction (+: h333) 
            for (int i=0;i<nqi;++i)
            {    
            int q1=qi[i];
            int m = frombus[q1];
                for (int k=0;k<N;++k)
                {
                    h333 += v[m-1]*v[k]*(G[m-1][k]*sin(del[m-1]-del[k]) 
                            - B[m-1][k]*cos(del[m-1]-del[k]));
                } 
                h3[i]=h333;
            }
            .
            .
            .
       }
 I don't think your OpenMP code gives the same result as without OpenMP.  Let's just concentrate on the h2[i] part of the code (since the h3[i] has the same logic).  There is a dependency of h2[i] on the index i (ie h2[1] = h2[1] + h2[0]).  The OpenMP reduction you're doing won't give the correct result.  If you want to do the reduction with OpenMP you need do it on the inner loop like this:  
float h222 = 0;
for (int i=0; i<npi; ++i) {
    int p1=ppi[i];
    int m = frombus[p1];        
    #pragma omp parallel for reduction(+:h222)
    for (int k=0;k<N; ++k) {
        h222 +=  v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k]) 
                         + B[m-1][k]*sin(del[m-1]-del[k]));
    }
    h2[i] = h222;
}
 However, I don't know if that will be very efficient.  An alternative method is fill h2[i] in parallel on the outer loop without a reduction and then take care of the dependency in serial.  Even though the serial loop is not parallelized it still should have a small effect on the computation time since it does not have the inner loop over k .  This should give the same result with and without OpenMP and still be fast.  
#pragma omp parallel for
for (int i=0; i<npi; ++i) {
    int p1=ppi[i];
    int m = frombus[p1];
    float h222 = 0;
    for (int k=0;k<N; ++k) {
        h222 +=  v[m-1]*v[k]*(G[m-1][k]*cos(del[m-1]-del[k]) 
                         + B[m-1][k]*sin(del[m-1]-del[k]));
    }
    h2[i] = h222;
}
//take care of the dependency serially
for(int i=1; i<npi; i++) {
    h2[i] += h2[i-1];
}    
Keep in mind that creating and destroying threads is a time consuming process; clock the execution time for the process and see for yourself. You only use parallel reduction twice which may be faster than a serial reduction, however the initial cost of creating the threads may still be higher. Try parallelizing the outer most loop (if possible) to see if you can obtain a speedup.
链接地址: http://www.djcxy.com/p/79232.html上一篇: 嵌套openmp循环
下一篇: 为什么我的C代码使用OpenMP更慢
