CUDA Kernel register size

On a compute capablility 1.3 GPU cuda card, we run the following code

for(int i=1;i<20;++i)

kernelrun<<<30,320>>>(...);

we know that each SM has 8 SP and can run 1024 threads, so there are 30 SM in tesla C1060 which can run 30*1024 threads concurrently.

As per the given code, how many threads can run concurrently ? If there are 48 registers for the kernelrun kernel , what are the limitations on tesla C1060?

which has 16384 registers and 16KB shared memory?

Since concurrent kernel execution is not supported in Tesla C1060, how can we execute the kernel in loop concurrently ? IS streams possible? only one concurrent copy and execute engine in tesla C1060?


NVIDIA have been shipping an Occupancy calculator which you can use to answer this question for yourself since 2007. You should try it.

But to answer your question, each SM in your compute 1.3 device has 16384 registers per SM, so the number of threads per block if your kernel is register limited would be roughly 352 (16384/45 rounded down to the nearest 32). There is also a register page allocation granularity to consider.

链接地址: http://www.djcxy.com/p/80098.html

上一篇: CUDA块/扭曲/线程如何映射到CUDA内核上?

下一篇: CUDA内核寄存器大小