Maximum number of blocks and threads working in parallel for a shared variable

Considering a GPU kernel function to be executed on the K2000 GPU card (compute capability 3.0) is shown below:

#define TILE_DIM 64
__global__ void PerformSomeOperations(float* g_A, float* g_B)
{
    __shared__ float BlockData[TILE_DIM][TILE_DIM];
    // Some Operation to be performed
}

How can I determine the maximum number of blocks and threads that can execute in parallel on a single multiprocessor? Also if I have N blocks does this mean that the shared memory for each block will be divided by N?


You can run the devicequery example from the sample to determine the max number of blocks. HERE Inside a each block you can have maximum 1024 threads.

How many blocks executing on a SM(Streaming multiprocessor)? Each SM can have upto 16 active blocks on Kepler and 8 active blocks on Fermi.

Also you need to think in terms of warps. One warp = 32 threads. In a Fermi, the number of active warps is 48 and in Kepler its 64 per SM. These are ideal numbers. The actual number of warps executing on a SM will depend on the Launch configuration and number of resources you are using in a kernel.

Usually you will calculate occupancy = active warps / number of max active warps.

If you have N blocks then the total shared memory is divided by N. If you want to have large number of blocks then you may want to check the occupancy calculator spread sheet to check how much of shared memory you can use without affecting the performance.

But,

__shared__ float BlockData[TILE_DIM][TILE_DIM];

is allocated per block so you have the whole chunk available in each block.

链接地址: http://www.djcxy.com/p/80112.html

上一篇: Cuda输出随计算能力而变化

下一篇: 为共享变量并行工作的最大块数和线程数