OpenMP: sharing arrays between threads

Good day everyone!

I'm conducting a molecular dynamics simulation, and recently I began to try to implement it in parallel. At first sight everything looked simple enough: write #pragma omp parallel for directive in front of the most time consuming loops. But as it happens, functions in those loops operate on arrays, or, to be precise, on arrays which belong to an object of my class that contains all information about particle system and functions acing on this system, so that when I added that #pragma directive before one of the most time consuming loops, the computation time actually increased several times despite the fact that my 2 core 4 thread processor was fully loaded.

In order to sort this out I wrote another, simpler program. This test program performs two identical loops, one in parallel, and the second one - in serial. The time it takes to execute both of these loops is measured. The results surprised me: whenever the first loop was computed in parallel, its computation time decreased in comparison with serial mode (1500 and 6000 ms respectively), but the computation time of the second loop increased drastically (15 000 against 6000 in serial).

I tried to use private() and firstprivate() clauses, but the results were the same. Shouldn't every variable defined and initialized before parallel region be shared automatically anyway? The computation time of the second loop gets back to normal if performed on another vector: vec2, but creating a new vector for every iteration is, clearly, not an option. I've also tried to put actual update of vec1 into #pragma omp critical area, but that wasn't any good either. Neither helped adding Shared(vec1) clause.

I would appreciate if you could point out my errors and show the proper way.

Is it necessary to put that private(i) into the code?

Here is this test program:

#include "stdafx.h"
#include <omp.h>
#include <array>
#include <time.h>
#include <vector>
#include <iostream>
#include <Windows.h>
using namespace std;
#define N1  1000
#define N2  4000
#define dim 1000

int main(){
    vector<int>res1,res2;
    vector<double>vec1(dim),vec2(N1);
    clock_t t, tt;
    int k=0;
    for( k = 0; k<dim; k++){
        vec1[k]=1;
    }

    t = clock();

    #pragma omp parallel 
        {
        double temp; 
        int i,j,k;
        #pragma omp for private(i)
            for( i = 0; i<N1; i++){
                for(j = 0; j<N2; j++){  
                    for( k = 0; k<dim; k++){
                        temp+= j;
                    }
                }
                vec1[i]+=temp;
                temp = 0;
            }
        }
    tt = clock();
    cout<<tt-t<<endl;
    for(int k = 0; k<dim; k++){
        vec1[k]=1;
    }
    t = clock();
                for(int g = 0; g<N1; g++){
        for(int h = 0; h<N2; h++){
            for(int y = 0; y<dim; y++){
                vec1[g]+=h; 
            }
        }
    }
    tt = clock();
    cout<<tt-t<<endl;
    getchar();
}

Thank you for your time!

PS I use visual studio 2012, My processor is Intel Core i3-2370M. My assembly file in two parts:

http://pastebin.com/suXn35xj

http://pastebin.com/EJAVabhF


Congratulations! You have exposed yet another bad OpenMP implementation, courtesy of Microsoft. My initial theory was that the problem comes from the partitioned L3 cache in Sandy Bridge and later Intel CPUs. But the result from running the second loop only on the first half of the vector did not confirm that theory. Then it has to be something in the code generator that is triggered when OpenMP is enabled. The assembly output confirms this.

Basically the compiler does not optimise the serial loop when compiling with OpenMP enabled. That's where the slowdown comes from. Part of the problem was also introduced by yourself by making the second loop not identical to the first one. In the first loop you accumulate intermediate values into a temporary variable, which the compiler optimises to register variable, while in the second case you invoke operator[] on each iteration. When you compile without OpenMP enabled, the code optimiser transforms the second loop into something which is quite similar to the first loop, hence you get almost the same run time for both loops.

When you enable OpenMP, the code optimiser does not optimise the second loop and it runs way slower. The fact that your code executes a parallel block before that has nothing to do with the slowdown. My guess is that the code optimiser is unable to grasp the fact that vec1 is outside of the scope of the OpenMP parallel region and hence it should no longer be treated as shared variable and the loop can be optimised. Obviously this is a "feature", which was introduced in Visual Studio 2012, since the code generator in Visual Studio 2010 is able to optimise the second loop even with OpenMP enabled.

One possible solution would be to migrate to Visual Studio 2010. Another (hypothetical, since I don't have VS2012) solution would be to extract the second loop into a function and to pass the vector by reference to it. Hopefully the compiler would be smart enough to optimise the code in the separate function.

This is a very bad trend. Microsoft have practically given up on supporting OpenMP in Visual C++. Their implementation still (almost) conforms to OpenMP 2.0 only (hence no explicit tasks and other OpenMP 3.0+ goodies) and bugs like this one do not make things any better. I would recommend that you switch to another OpenMP enabled compiler (Intel C/C++ Compiler, GCC, anything non-Microsoft) or switch to some other compiler independent threading paradigm, for example Intel Threading Building Blocks. Microsoft is clearly pushing their parallel library for .NET and that's where all the development goes.


Big Fat Warning

Do not use clock() to measure the elapsed wall-clock time! This only works as expected on Windows. On most Unix systems (including Linux) clock() actually returns the total consumed CPU time by all threads in the process since it was created . This means that clock() may return values which are either several times larger than the elapsed wall-clock time (if the program runs with many busy threads) or several times shorter that the wall-clock time (if the program sleeps or waits on IO events between the measurements). Instead, in OpenMP programs, the portable timer function omp_get_wtime() should be used.

链接地址: http://www.djcxy.com/p/13902.html

上一篇: 在嵌套循环中使用OpenMP的性能问题

下一篇: OpenMP:在线程之间共享数组