Cuda output varying with compute capability
I am trying to run the following kernel which is similar to sparse matrix vector multiplication(SpMV) kernel.
__global__ void cost_spmv(const int *population,const int *row,const int *col,int *out){
/*Calculate the cost vector for multiplication of the matrices*/
//int tid=threadIdx.x+blockDim.x*blockIdx.x;
int lane=threadIdx.x;
extern __shared__ int val[];
int r=blockIdx.x;
int rowStart=row[r];
int rowEnd=row[r+1];
val[threadIdx.x]=0;
for(int i=rowStart+lane;i<rowEnd;i+=32)
val[threadIdx.x]+= population[col[i]];
__syncthreads();
if(lane<16)
val[threadIdx.x]+=val[threadIdx.x+16];
if(lane<8)
val[threadIdx.x]+=val[threadIdx.x+8];
if(lane<4)
val[threadIdx.x]+=val[threadIdx.x+4];
if(lane<2)
val[threadIdx.x]+=val[threadIdx.x+2];
if(lane<1)
val[threadIdx.x]+=val[threadIdx.x+1];
if(lane==0)
out[r]=val[threadIdx.x];
}
The kernel is invoked using
cost_spmv<<<numRows,32,32*sizeof(int)>>>(population,rowArray,colArray, out)
Where numRows
is the size of the arrays population,out and rowArray(numRows+1 actually)
. rowArray[i]
contains the starting index of the elements belonging to row i.The size of colArray is rowArray[numRows]
. colArray[i]
contains the column numbers which have non zero value for the row described using the rowArray
.
However on compiling it for compute capability 3.5 on Tesla P4 I get different answer compared to what I get for compute capability 6.1. Also the answer I get using compute capability 6.1 on Tesla P4 is the same as what I get using compute capability 3.5 on 920m . What could be the reason for it?
Keep in mind that the CUDA compiler has a single-thread view of the world. It knows nothing of the run-time configuration used to execute the code, which is not available at compile time.
There are no dependencies expressed in the code between the loads of val[]
and previous writes to val[]
. Therefore the compiler is free to move the loads as it sees fit. In some cases it may chose to issue some or all loads early to increase the load-latency tolerance of the code, eg by transforming the code as follows:
int __temp0 = val[threadIdx.x+16];
int __temp1 = val[threadIdx.x+ 8];
int __temp2 = val[threadIdx.x+ 4];
int __temp3 = val[threadIdx.x+ 2];
int __temp4 = val[threadIdx.x+ 1];
if(lane<16)
val[threadIdx.x]+=__temp0;
if(lane<8)
val[threadIdx.x]+=__temp1;
if(lane<4)
val[threadIdx.x]+=__temp2;
if(lane<2)
val[threadIdx.x]+=__temp3;
if(lane<1)
val[threadIdx.x]+=__temp4;
Depending on where the compiler chooses to place the loads, the results of the reduction sequence will differ. Code generation, and instruction scheduling in particular, in the CUDA compiler differs by GPU architecture, so different results may be observed when compiling for different GPU architectures.
To enforce the desired dependencies between loads and stores, the method sanctioned by the CUDA programming model is to use __syncthreads()
after every reduction step to create a barrier. The potentially faster, but hacky way to achieve the desired outcome is to declare val
alterable by agents outside the scope of the code through use of the volatile
modifier. This prevents the compiler from moving around the loads from val[]
.
下一篇: Cuda输出随计算能力而变化