Altera OpenCL parallel execution in FPGA

I have been looking into Altera OpenCL for a little while, to improve heavy computation programs by moving the computation part to FPGA. I managed to execute the vector addition example provided by Altera and seems to work fine. I've looked at the documentations for Altera OpenCL and came to know that OpenCL uses pipelined parallelism to improve performance.

I was wondering if it is possible to achieve parallel execution similar to multiple processes in VHDL executing in parallel using Altera OpenCL in FPGA. Like launching multiple kernels in one device that can execute in parallel? Is it possible? How do I check if it is supported? Any help would be appreciated.

Thanks!


The quick answer is YES.

According to the Altera OpenCL guides, there are generally two ways to achieve this:

1/ SIMD for vectorised data load/store

2/ replicate the compute resources on the device

For 1/, use num_simd_work_items and reqd_work_group_size kernel attributes, multiple work-items from the same work-group will run at the same time

For 2/, use num_compute_units kernel attribute, multiple work-groups will run at the same time

Please develop single work-item kernel first, then use 1/ to improve the kernel performance, 2/ will generally be considered at last.

By doing 1/ and 2/, there will be multiple work-groups, each with multiple work-items running at the same time on the FPGA device.

Note: Depending on the nature of the problem you are solving, may the above optimization may not always suitable.


If you're talking about replicating the kernel more than once, you can increase the number of compute units. There is a attribute that you can add before the kernel.

__attribute__((num_compute_units(N)))
__kernel void test(...){
    ...
}

By doing this you essentially replicate the kernel N times. However, the Programming guide states that you probably first look into using the simd attribute where it performs the same operation but over multiple data. This way, the access to global memory becomes more efficient. By increasing the number of compute units, if your kernels have global memory access, there could be contention as multiple compute units are competing for access to global memory.

You can also replicate operations at a fine-grained level by using loop unrolling. For example,

#pragma unroll N
for(short i = 0; i < N; i++)
    sum[i] = a[i] + b[i]

This will essentially perform the summing of a vector by element N times in one go by creating hardware to do the addition N times. If the data is dependent on the previous iteration, then it unrolls the pipeline.

On the other hand, if your goal is to launch different kernels with different operations, you can do that by creating your kernels in an OpenCL file. When you compile the kernels, it will map and par the kernels in the file into the FPGA together. Afterwards, you just need to envoke the kernel in your host by calling clEnqueueNDRangeKernel or clEnqueueTask. The kernels will run side by side in parallel after you enqueue the commands.

链接地址: http://www.djcxy.com/p/46434.html

上一篇: 使用虚拟单核虚拟化guest虚拟机上的所有cpu内核

下一篇: FPGA中的Altera OpenCL并行执行