How to determine why a CUDA stream is blocking

I am trying to switch an algorithm that I had written from a Tesla T10 processor (compute capability 1.3) to a Tesla M2075 (compute capability 2.0). While switching I was surprised to find that my algorithm slowed down. I analyzed it and found that it seems to be because on the new machine the cuda streams are blocking. My algorithm has 3 main tasks that can be split and run in parallel: memory reorganization (which can be done on the CPU), memory copying from the host to the device, and the kernel execution on the device. On the old machine splitting the streams allowed the 3 tasks to overlap like this (all screenshots from the NVidia Visual Profiler): 正确的流执行

However on the new machine the streams block before starting the CPU computation until the previous kernel is done executing, as can be seen here: 3流执行

You can see the top row, all the orange blocks are the cudaStreamSynchronize calls which block until the previous kernel is done execution, even though that kernel is on a completely different stream. It seems to work for the first run through the streams and correctly parallelizes, but after that the problem starts, so I thought that maybe it was blocking on something and I tried to increase the number of streams which gave me this result: 12流执行

Here you can see that for some reason only the first 4 streams are blocking, after that it starts parallelizing properly. As a last attempt I tried to hack around it by only using the first 4 streams for one time only and then switching to use the later streams but that still didn't work and it still stalled every 4 streams while letting the other streams execute concurrently: 10个流执行

So I am looking for any ideas as to what could be causing this problem and how to diagnose it. I have pored over my code and I don't think that it is a bug there, although I could be mistaken. Each stream is encapsulated in its own class and only has a reference to a single cudaStream_t which is a member of that class so I don't see how it could be referencing another stream and blocking on it.

Are there some changes to the way streams work between version 1.3 and 2.0 that I'm not aware of? Could it be something with shared memory not being freed and it having to wait on that? Any ideas for how to diagnose this problem are welcome, thanks.


I cannot be completely sure without seeing code, but it looks like you may be having an issue with the order in which you enqueue your commands. There is a slight difference in the way compute capability 1.x and 2.x devices handle streams due to the fact that 2.x devices can run multiple kernels concurrently and handle both HtoD and DtoH simultaneously.

If you enqueue your commands in the order all HtoDs, all computes, all DtoHs you will have good results on Tesla cards (1060 et. al.).

If you order them copy HtoD, compute, copy DtoH, copy HtoD... etc. you will have good results on Fermi.

Kepler does equally well in both cases. This does matter across streams in both Tesla and Fermi cases, I suggest reading this NVIDIA post for more information. Overlapping across streams can be an extremely complicated problem, I wish you well. If you want further help, a general representation of the order in which you enqueue operations would be extremely helpful.

链接地址: http://www.djcxy.com/p/14060.html

上一篇: Android共享显示在其他设备上

下一篇: 如何确定CUDA流阻塞的原因