Block dimensions in CUDA

I have a NVIDIA GTX 570 compute capability 2.0 running cuda-4.0.

The deviceQuery executable in the CUDA SDK gives me information on my CUDA device and its various properties. Two of the lines in the output are

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Why is the 3rd dimension of the block restricted to be upto 64 threads only wheras the X and the Y dimension can vary upto 1024 threads?


EDIT2: ALso, please take this with a grain of salt; this is a purely hypothetical answer, or a guess. There may indeed be a clear hardware-based reason why 64 is the maximum. Frankly I don't know, and my answer is based on an assumption that there is no such hardware limit, per se.

It's probably a combination of three things: first, there is a limit to the number of threads which can be resident inside a block; second, block dimensions are typically in multiples of 32, and even more often in powers of 2 greater than 32; third, coordinate systems used in the solution of multi-dimensional problems are most often oriented so that you're looking at the scene directly (ie, with the important bits more distributed in X and Y than in Z).

CUDA naturally has to support 1D access, as this is an immensely common and efficient access pattern when it is applicable. TO support this, the X dimension must be allowed to vary over the entire range of 1024 threads.

To support 2D access, which is less common, CUDA should minimally support up to 512 in the X dimension (using the convention that the X dimension should be oriented in the coordinate system so that it measures the biggest spread) and 32 in the Y dimension. It must support up to 1024 in the X dimension, and I suppose they relax the requirement that the X dimension be no smaller than the Y dimension and allow the full 1024 range of Y values. However, in my understanding, 32 would have been plenty big for the Y dimension maximum.

To support 3D access, maintaining X, Y >= Z and trying to reach 1024, it seems to be that in the best case X=Y=Z=10; so there's no real argument for allowing Z to be greater than 10, given my assumptions

In summary, I don't see why they couldn't have made the maximums (1024, 32, 10). My question is why make them (1024, 1024, 64)? The only answer I keep coming back to is to allow some flexibility to programmers to violate the X>=Y>=Z coordinate system convention.

Edit: given my summary and hypothetical answer, the real answer to your question is this: it's an arbitary decision.


My wild guess is that because threadIdx.x, threadIdx.y and threadIdx.z are kept in a special single 32-bit register, possibly with even some other additional data. Maybe warp id? Or maybe multiprocessor-block id to identify which block given thread handles, if given multiprocessor runs more than one?

This is purely speculative, I have no data to support it, but I would imagine that they want to have as few special registers as possible.

链接地址: http://www.djcxy.com/p/38466.html

上一篇: 驱动程序找不到兼容的图形硬件

下一篇: CUDA中的块尺寸