Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently. However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.

My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.

I am confused because the CUDA C programming guide says:

"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context." This brings me to my second question, what are CUDA "contexts"?

Thanks!


A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.

So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.

You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.


Do you really need to have separate threads and contexts? I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.

To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.

If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

链接地址: http://www.djcxy.com/p/47420.html

上一篇: GPU CUDA推力

下一篇: 多个进程并行启动CUDA内核