Performance problems using OpenMP in nested loops

I'm using the following code, which contains an OpenMP parallel for loop nested in another for-loop. Somehow the performance of this code is 4 Times slower than the sequential version (omitting #pragma omp parallel for).

Is it possible that OpenMp has to create Threads every time the method is called? In my test it is called 10000 times directly after each other.

I heard that sometimes OpenMP will keep the threads spinning. I also tried setting OMP_WAIT_POLICY=active and GOMP_SPINCOUNT=INFINITE . When I remove the openMP pragmas, the code is about 10 times faster. Note that the method containing this code will be called 10000 times.

        for (round k = 1; k < processor.max; ++k) {
            initialise_round(k);


            for (std::vector<int> bucket : color_buckets) {
                #pragma omp parallel for schedule (dynamic)
                for (int i = 0; i < bucket.size(); ++i) {
                    if (processor.mark.is_marked_item(bucket[i])) {
                        processor.process(k, bucket[i]);
                    }
                }
            processor.finish_round(k);
        }
    }

You say that your sequential code is much faster so this makes me think that your processor.process function has too few instructions and duration. This leads to the case where passing the data to each thread does not pay off (the data exchange overhead is simply larger than the actual computation on that thread).

Other than that, I think that parallelizing the middle loop won't affect the algorithm but increase the amount of work per thread/


I think you are creating a team of threads on each iteration of the loop... (although I'm not sure what for alone does - I thought it should be parallel for ). In this case, it would probably be better to separate the parallel from the for so the work of forking and creating the threads is done just once rather than being repeated in the other loops. So you could try to put a parallel pragma before your outermost loop so the overhead of forking and thread creation is just done once.


The actual problem was not related to OpenMP directly.

As the system has two CPUs, half of the threads where spawned on one and the other half on the other CPU. Therefore there was not shared L3 Cache. This lead in combination that the algorithm doesn't scale well to a performance decrease especially when using 2-4 Threads.

The solution was to use thread pinning for example via the linux tool: taskset

链接地址: http://www.djcxy.com/p/13904.html

上一篇: 我应该把大数组放在堆栈还是堆?

下一篇: 在嵌套循环中使用OpenMP的性能问题