GPU programming strategies using CUDA

I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.

  • Single host thread - multiple devices (shared context)
  • Single host thread - concurrent execution of kernels on a single device (shared context)
  • Multiple host threads - (Equal) Multiple devices (independent contexts)
  • Single host thread - Sequential kernel execution on one device
  • Multiple host threads - concurrent execution of kernels on one device (independent contexts)
  • Multiple host threads - sequential execution of kernels on one device (independent contexts)
  • Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice wrt multi-GPU programming is welcome.

    Thanks,
    Sayan

    EDIT:

    I thought that the previous categorization involved some redundancy, so modified it.


    Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.

    Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.


    Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.

    Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.

    链接地址: http://www.djcxy.com/p/8316.html

    上一篇: 水平ProgressBar:在确定和不确定之间切换

    下一篇: 使用CUDA的GPU编程策略