openmp latency for inside for

2018-06-28 08:44:20

I have a piece of code that i want to parallelize and the openmp program is much slower than the serial version, so what is wrong with my implementation?. This is the code of the program

#include <iostream>
#include <gsl/gsl_math.h>
#include "Chain.h"
using namespace std;

int main(){
  int const N=1000;
  int timeSteps=100;
  double delta=0.0001;
  double qq[N];
  Chain ch(N);
  ch.initCond();
  for (int t=0; t<timeSteps; t++){
    ch.changeQ(delta*t);
    ch.calMag_i();
    ch.calForce001();
  }
  ch.printSomething();
}

The Chain.h is

class Chain{
  public:
    int N;
    double *q;
    double *mx;
    double *my;
    double *force;

    Chain(int const Np);
    void initCond();
    void changeQ(double delta);
    void calMag_i();
    void calForce001();
};

And the Chain.cpp is

Chain::Chain(int const Np){
  this->N     = Np;
  this->q     = new double[Np];
  this->mx    = new double[Np];
  this->my    = new double[Np];
  this->force = new double[Np];  
}

void Chain::initCond(){
  for (int i=0; i<N; i++){
    q[i]     = 0.0;
    force[i] = 0.0;
  }
}

void Chain::changeQ(double delta){
  int i=0;
  #pragma omp parallel
  {
    #pragma omp for
    for (int i=0; i<N; i++){
      q[i] = q[i] + delta*i + 1.0*i/N;
    }
  }
}

void Chain::calMag_i(){
  int i =0;
  #pragma omp parallel
  {
    #pragma omp for
    for (i=0; i<N; i++){
      mx[i] = cos(q[i]);
      my[i] = sin(q[i]);
    }
  }
}

void Chain::calForce001(){
  int i;
  int j;
  double fij =0.0;
  double start_time = omp_get_wtime();
  #pragma omp parallel
  {
    #pragma omp for private(j, fij)
    for (i=0; i<N; i++){
      force[i] = 0.0;
      for (j=0; j<i; j++){
        fij = my[i]*mx[j] - mx[i]*my[j];
        #pragma omp critical
        {
          force[i] +=  fij;
          force[j] += -fij;
        }
      }
    }
  }
  double time = omp_get_wtime() - start_time;
  cout <<"time = " << time <<endl;
}

So the methods changeQ() and calMag_i() are in fact faster than the serial code, but my problem is the calForce001() . The execution time are:

with openMP 3.939s

without openMP 0.217s

Now, clearly i'm doing something wrong or the code can't be parallelize. Please any help with be usefull. Thanks in advance. Carlos

Edit: In order to clarify the question i add the functions omp_get_wtime() to calculate the execution time for the function calForce001() and the times for one execution are

with omp :0.0376656

without omp: 0.00196766

So with omp method is 20 times slower .

Otherwise, i'm also calculate the time for the calMag_i() method

with omp: 3.3845e-05

without omp: 9.9516e-05

for this method omp is 3 times faster .

I hope this confirm that the latency problem is in the calForce001() method.

There are three reasons why you don't benefit from any speedup.

you have #pragma omp parallel all over your code. What this pragma does, is start the "team of threads". At the end of the block, this team is disbanded. This is quite costly. Removing those and using #pragma omp parallel for instead of #pragma omp for will start the team upon first encounter and put it to sleep after each block. This made the application 4x faster for me.

you use #pragma omp critical . On most platforms, this will force the use of a mutex - which is heavily contended because all threads want to write to that variable at the same time. So, don't use a critical section here. You could use atomic updates, but in this case, that won't make much of a difference - see third item. Just removing the critical section improved the speed by another 3x.

Parallelism only makes sense when you have an actual workload. All of your code is too small to benefit from parallelism. There's simply too little workload to win back the time lost on starting/waking/destroying the threads. If your workload would be ten times this, some of the parallel for statements would make sense. But especially Chain::calForce001() will never be worth it if you have to do atomic updates.

With respect to programming style: you're programming in C++. Please use local scope variables wherever you can - in eg Chain::calForce001() , use a local double fij inside the inner loop. That saves you from having to write private clauses. Compilers are smart enough to optimize that. Correct scoping allows for better optimizations.

链接地址: http://www.djcxy.com/p/79240.html

上一篇: OpenMP与嵌套循环

下一篇: openmp等待内部for