CUDA concurrent kernels serialize when using events

I have come across a serialization issue in CUDA kernels where concurrent execution is expected. I an using cudaEvents as markers to track kernel executions.

In my experiments on concurrent kernels with multiple streams, we observed that using events on their respective streams causes concurrent kernels to get serialized.

The code below demonstrates this issue. I tested this on two different devices which have concurrent kernel execution capabilities listed below:

  • Tesla C2070, Driver version 4.10, Runtime version 4.10, CUDA capability 2.0
  • Tesla M2090, Driver version 4.10, Runtime version 4.10, CUDA capability 2.0
  • You can run the program with and w/o events by changing USE_EVENTS macro and you will observe the difference due to concurrent execution vs. serial execution.

    #include<cuda.h>
    #include<pthread.h>
    #include<stdio.h>
    #include<stdlib.h>
    #include<stdint.h>
    
    #define CUDA_SAFE_CALL( call) do {                                        
    cudaError_t err = call;                                                    
    if( cudaSuccess != err) {                                                
    fprintf(stderr, "Cuda error in call at file '%s' in line %i : %s.n", 
    __FILE__, __LINE__, cudaGetErrorString( err) );              
    exit(-1);                                                     
    } } while (0)
    
    
    
    // Device code
    __global__ void VecAdd(uint64_t len)
    {
        volatile int a;
        for(uint64_t n = 0 ; n < len ; n ++)
            a++; 
        return ;
    }
    
    #define USE_EVENTS
    
    int
    main(int argc, char *argv[])
    {
    
        cudaStream_t stream[2];
        for(int i = 0 ; i < 2 ; i++) 
            CUDA_SAFE_CALL(cudaStreamCreate(&stream[i]));
    
    #ifdef USE_EVENTS
        cudaEvent_t e[4];
        CUDA_SAFE_CALL(cudaEventCreate(&e[0]));
        CUDA_SAFE_CALL(cudaEventCreate(&e[1]));
        CUDA_SAFE_CALL(cudaEventRecord(e[0],stream[0]));
    #endif
        VecAdd<<<1, 32, 0, stream[0]>>>(0xfffffff);
    
    #ifdef USE_EVENTS
        CUDA_SAFE_CALL(cudaEventRecord(e[1],stream[0]));
    #endif
    
    #ifdef USE_EVENTS
        CUDA_SAFE_CALL(cudaEventCreate(&e[2]));
        CUDA_SAFE_CALL(cudaEventCreate(&e[3]));
        CUDA_SAFE_CALL(cudaEventRecord(e[2],stream[1]));
    #endif
        VecAdd<<<1, 32, 0, stream[1]>>>(0xfffffff);
    
    #ifdef USE_EVENTS
        CUDA_SAFE_CALL(cudaEventRecord(e[3],stream[1]));
    #endif
        CUDA_SAFE_CALL(cudaDeviceSynchronize());
    
        for(int i = 0 ; i < 2 ; i++) 
            CUDA_SAFE_CALL(cudaStreamDestroy(stream[i]));
    
        return 0;
    
    }
    

    Any suggestions in why this might be happening and how to circumvent this serialization will be useful.


    The above example issues work in the following order:

    1 event record on stream A
    2 launch on stream A
    3 event record on Stream A
    4 event record on stream B
    5 launch on stream B
    6 event record on stream B
    

    CUDA operations on the same stream execute in issue order. CUDA operations in different streams may run concurrently.

    By the programming model definition there should be concurrency. However, on current devices this work is issued to the GPU through a single push buffer. This causes the GPU to wait for operation 2 to complete before issuing operation 3 and operation 4 to complete before issuing 5, ... If the event records are removed then the operations are

    1 launch on stream A
    2 launch on stream B
    

    Operation 1 and 2 are on different streams so it is possible for the GPU to execute the two operations concurrently.

    Parallel Nsight and the CUDA command line profiler (v4.2) can be used to time concurrent operation. The command line profiler option is "conckerneltrace". This feature should appear in a future version of the NVIDIA Visual Profiler.


    I was debugging fundamentally the same issue. Greg's answer was very helpful, though the explanation does not seem complete. The real problem is that op 3 is waiting on 2 when 4 is issued. Even though 4 is in a different stream, if there is already a kernel/event waiting in the issue queue, it cannot be issued. This is similar to the case where more than one kernel is issued consecutively per stream. This can be solved by delaying the end-of-stream event as follows:

  • event record on stream A (start timer)
  • launch on stream A
  • event record on stream B (start timer)
  • launch on stream B
  • event record on stream A (end timer)
  • event record on stream B (end timer)
  • Since launches are asynchronous, the end-of-stream events will wait until both the previous kernel launch in that stream is done, and kernel issues for all other streams have been launched. Clearly, this will cause the end-timers to be issued too late if there are more streams than can be issued concurrently on the given hardware.

    链接地址: http://www.djcxy.com/p/80102.html

    上一篇: cuda内核不能同时执行

    下一篇: CUDA并发内核在使用事件时序列化