CPU/Intel OpenCL performance issues, implementation questions

I have some questions hanging in the air without an answer for a few days now. The questions arose because I have an OpenMP and an OpenCL implementations of the same problem. The OpenCL runs perfectly on GPU but has 50% less performance when ran on CPU (compared to the OpenMP implementation). A post is already dealing with the difference between OpenMP and OpenCL performances, but it doesn't answer my questions. At the moment I face these questions:

1) Is it really that important to have " vectorized kernel " (in terms of the Intel Offline Compiler)?

There is a similar post, but I think my question is more general.

As I understand: a vectorized kernel not necessarily means that there is no vector/SIMD instruction in the compiled binary. I checked the assembly codes of my kernels, and there are a bunch of SIMD instructions. A vectorized kernel means that by using SIMD instructions you can execute 4 (SSE) or 8 (AVX) OpenCL "logical" threads in one CPU thread. This can only be achieved if ALL your data is consecutively stored in the memory. But who has such perfectly sorted data?

So my question would be: Is it really that important to have your kernel "vectorized" in this sense?

Of course it gives performance improvement, but if most of the computation intensive parts in the kernel are done by vector instructions then you might get near the "optimal" performance. I think the answer to my question lies in the memory bandwidth. Probably vector registers better fit to efficient memory access. In that case the kernel arguments (pointers) have to be vectorized.

2) If I allocate data in local memory on a CPU , where will it be allocated? OpenCL shows L1 cache as local memory, but it is clearly not the same type of memory like on GPU's local memory. If its stored in the RAM/global memory, then there is no sense copying data into it. If it would be in cache, some other process would might flush it out... so that doesn't make sense either.

3) How are "logical" OpenCL threads mapped to real CPU software/hardware(Intel HTT) threads? Because if I have short running kernels and the kernels are forked like in TBB (Thread Building Blocks) or OpenMP then the fork overhead will dominate.

4) What is the thread fork overhead ? Are there new CPU threads forked for every "logical" OpenCL threads or are the CPU threads forked once, and reused for more "logical" OpenCL threads?

I hope I'm not the only one who is interested in these tiny things and some of you might now bits of these problems. Thank you in advance!


UPDATE

3) At the moment OpenCL overhead is more significant then OpenMP, so heavy kernels are required for efficient runtime execution. In Intel OpenCL a work-group is mapped to an TBB thread, so 1 virtual CPU core executes a whole work-group (or thread block). A work-group is implemented with 3 nested for loops, where the inner most loop is vectorized, if possible. So you could imagine it something like:

#pragam omp parallel for
for(wg=0; wg < get_num_groups(2)*get_num_groups(1)*get_num_groups(0); wg++) {

  for(k=0; k<get_local_size(2); k++) {
    for(j=0; j<get_local_size(1); j++) {
      #pragma simd
      for(i=0; i<get_local_size(0); i++) {
        ... work-load...
      }
    }
  }
}

If the inner most loop can be vectorized it steps with SIMD steps:

for(i=0; i<get_local_size(0); i+=SIMD) {

4) Every TBB thread is forked once during the OpenCL execution and they are reused. Every TBB thread is tied to a virtual core, ie. there is no thread migration during the computation.

I also accept @natchouf-s answer.


I may have a few hints to your questions. In my little experience, a good OpenCL implementation tuned for the CPU can't beat a good OpenMP implementation . If it does, you could probably improve the OpenMP code to beat the OpenCL one.

1) It is very important to have vectorized kernels . It is linked to your question number 3 and 4. If you have a kernel that handles 4 or 8 input values, you'll have much less work items (threads), and hence much less overhead. I recommend to use the vector instructions and data provided by OpenCL (like float4, float8, float16) instead of relying on auto-vectorization. Do not hesitate to use float16 (or double16): this will be mapped to 4 sse or 2 avx vectors and will divide by 16 the number of work items required (which is good for CPU, but not always for GPU: I use 2 different kernels for CPU and GPU).

2) local memory on CPU is the RAM. Don't use it on a CPU kernel.

3 and 4) I don't really know, it will depend on the implementation, but the fork overhead seems important to me.


for question 3:

Intel group logical OpenCL threads into one hardware thread. and the group size can varies from 4, 8, to 16. A logical OpenCL thread map to one SIMD lane of execution unit. one execution unit has two SIMD engines with a width of 4. please refer to following document for further details. https://software.intel.com/sites/default/files/Faster-Better-Pixels-on-the-Go-and-in-the-Cloud-with-OpenCL-on-Intel-Architecture.pdf

链接地址: http://www.djcxy.com/p/46426.html

上一篇: Virtual OpenCL运行内核的问题

下一篇: CPU / Intel OpenCL性能问题,实施问题