Nested openmp loop
I have a piece of code in the following style:
for (set=0; set < n; set++) //For1
{
#pragma omp parallel for num_threads(x)
for (i=0; i < m; i++) //For2: this loop can be executed in parallel
{
commands...
}
for (j=0; j < m; j++) //For3: this loop depends on the output of the For2 and also should be executed in a sequential way
{
commands...
}
}
As you notice, I have n independent Sets (outer loop, ie For1). Each Set consists of a parallel loop (For2) and a sequential section (For3) which should be executed after For2.
I already used "#pragma omp parallel for num_threads(x)" for For2 to make it parallel.
Now I wanna to make the outer loop (For1) parallel as well. In other words, I wanna to run each Set in parallel.
I really appreciate if you could let me know how it is possible in openmp.
one way might be creating n threads corresponding to each Set. is it correct? But I am wondering if there is another way by entirely using openmp features?
thanks in advance.
你可以简单地通过平行外部循环
#pragma omp parallel for num_threads(x) private(i,j)
for (set=0; set < n; set++) //For1
{
for (i=0; i < m; i++) //For2: this loop can be executed in parallel
{
commands...
}
for (j=0; j < m; j++) //For3: this loop depends on the output of the For2 and also should be executed in a sequential way
{
commands...
}
}
You can try fusing the first and second loop (see below). I don't know if that will make it better but it's worth a try.
#pragma omp parallel num_threads(x) private(set, i)
{
#pragma omp for schedule(static)
for (k = 0; k < n*m; k++) //fused For1 and For2
{
set = k/m;
i = k%m;
//commands...
}
#pragma omp for schedule(static)
for (set = 0; set < n; set++)
{
for (i = 0; i < m; i++) //For3 - j is not necessary so reuse i
{
//commands...
}
}
}
Simply parallelizing the outer loop could prove to be the best option for you, depending on the number of sets you have. If you have more than the number of cores on your computer, then it could be faster than parallelizing the inner loop as there is much less thread-creation overhead in that case.
Assuming your operations are cpu-bound, with the outer loop parallelized, you will fully use all cores on your computer. Further trying to parallelize the inner loop will not be faster if all resources all in used already.
In the case you have fewer sets than available cores, parallelize the inner loop and you will most probably already consume all available computing power.
If you really want to parallelize both loops, then you should consider MPI and do hybrid parallelization on several computers ; the outer loop parallelized over several computers, and the inner loop parallelized over all cores of a single computer.
链接地址: http://www.djcxy.com/p/79234.html上一篇: Openmp:嵌套循环和分配
下一篇: 嵌套openmp循环