Is cuda suitable for array filtering

I am working on optimizing a program with a time complexity of O(N^7). I have an array of strings, represented as 32 bit integers, where each bit corresponds to a specific character in the input string. The job is to find all combinations of input strings, where each character is present exactly once, and having all characters present. The naive solutions require 7 layers of recursion, with each layer iterating over the entire list. This pretty soon becomes extremely slow.

So I was wondering if I could use cuda to speed up the process a bit, by feeding the GPU with an array of possible strings, and a bitmask that should not be matched, and get a filtered list back, so I could speed up the recursive steps a bit.

So the question is: Is this kind of filtering suitable for parallel processing?

What I am doing in C right now is described below.

void recursive_search (unsigned int used, unsigned int *list, int listlen,
                       int start,unsigned int * stack, int reclevel) {
  int index, newindex;
  newindex = 0;

  for (index=0; index< listlen; index++) {
    if (!list[index] & used) {
      newlist[newindex++] = list[index];
    }
  }      

  if ((newindex == 1 && (used | newlist[0])) == 0xffffffff) {
    /* Hooray! We have a match */
    stack[reclevel] = newlist[0];
    report_match(stack);
    return;
  }

  for (index = 0;index < newindex; index++) {
    recursive_search (used | newlist[index], newlist, newindex,
                      index, stack, reclevel + 1);
  }
}

I hope this make my question clearer.


  • Do not try to convert the full algorithm into a single kernel. Try to parallelize each step one by one.
  • This code section below may be converted to a copy_if statement.

    for (index=0; index< listlen; index++) {
        if (!list[index] & used) {
            newlist[newindex++] = list[index];
        }
    }
    

    And the statement is like

    thrust::copy_if(list.begin(), list.end(), newlist.begin(), predicate());
    

    So you will achieve your new list easily.

    You can generate the array of possible strings on the GPU?

    About the recursion:

  • Because you will have many, many threads active in the GPU, there might be more than one result/match possible. Returning first match, or all matches may be a problem.
  • If recursion is a must, you may call the kernel several times by storing the partial progress in the global memory at each time.
  • As @harrism indicated, you may convert the recursive part to a simpler flow and let the parallelism handle each cases.
  • 链接地址: http://www.djcxy.com/p/25392.html

    上一篇: Cuda图像平均过滤器

    下一篇: cuda适合阵列过滤