Many set intersections on GPU

I'm investigating how suitable could a GPU (CUDA) be for my massive computation task.

I have 2.5 million 1024 element integer sorted vectors. I need to compare all of them. The comparison is a standard set_intersection as performed by the std::set_intersection of the C++ standard library and the comparison is symmetric.

This means the the number of comparisons I need to do is (2.5 millions x 2.5 millions - 2.5 millions)/2 = ~3.125x10¹² comparisons.

All the standard optimization libraries available for CUDA only work on two pairs of vectors of different sizes, which is not my situation as my 1024 vector size can't be changed but I have a huge amount of pairs to compare.

So far all my tests show that on a GTX 1060 GPU I can achieve only a 10x speedup compared to a single core of a Intel(R) Xeon(R) CPU E5-1620 0 @ 3.60GHz using SSE instructions to speed up the set_intersection algorithm.

This seems to be mainly due to the fact that my dataset is huge and every single computation task needs 8KB (2x 1024 int vectors) to calculate the result.

I don't seem to be able to use the CUDA shared memory in any useful way as my computation needs a lot of data to be available to each thread and quickly exceeds the CUDA shared memory of any available device.

I would appreciate some ideas on how to approach this kind of problems on modern GPUs especially related to huge datasets.

Is the ratio of x10 compared to a single CPU core something usually observed for this kind of problem?

#include <stdio.h>
#include <iostream>
#include <chrono>

typedef struct hash_type {
int32_t data[1024];
} hash_type;

__global__
void test_kernel_without_shared_memory(int job_size, hash_type* dev_full_hash_list_ptr, uint64_t nb_of_hashes, uint8_t* result_ptr)
{
    int job_id = blockIdx.x*blockDim.x + threadIdx.x;

    if (job_id < job_size) {
        uint32_t i = (uint32_t)(nb_of_hashes+0.5-sqrtf((nb_of_hashes-0.5)*(nb_of_hashes-0.5) - 2*job_id));
        uint32_t j = i + job_id - ((i-1)*(nb_of_hashes-(double)i/2));
        i--;

        size_t a = 0, b = 0;
        size_t counter = 0;

        while(a < 1024 && b < 1024) {
            if((dev_full_hash_list_ptr+i)->data[a] < (dev_full_hash_list_ptr+j)->data[b]) {
                a++;
            } else if((dev_full_hash_list_ptr+i)->data[a] > (dev_full_hash_list_ptr+j)->data[b]) {
                b++;
            } else {
                counter++;
                a++; b++;
            }
        }
        result_ptr[job_id] = (uint8_t)(0.5 + 100*counter / (double) (2 * 1024 - counter));
    }
}

#define threads_per_block 1024
#define max_grid_dimension 255*255*255*255
uint8_t* GPU_compute_distances(hash_type* full_hash_list, uint64_t nb_hashes) {
    hash_type* dev_full_hash_list;
    cudaMalloc(&dev_full_hash_list, nb_hashes*sizeof(hash_type)); 
    cudaMemcpy(dev_full_hash_list, full_hash_list, nb_hashes*sizeof(hash_type), cudaMemcpyHostToDevice);

    uint64_t total_job_size = (nb_hashes * nb_hashes - nb_hashes)/2;
    uint8_t* dev_result_ptr;
    cudaMalloc(&dev_result_ptr, total_job_size*sizeof(uint8_t)); 

    printf("total_job_size=%lun", total_job_size);
    printf("nb_hashes=%lun", nb_hashes);

    test_kernel_without_shared_memory<<<total_job_size/threads_per_block+1, threads_per_block>>>(total_job_size, dev_full_hash_list, nb_hashes, dev_result_ptr);

    uint8_t* result_ptr;
    result_ptr = (uint8_t*)malloc(total_job_size*sizeof(uint8_t));

    cudaMemcpy(result_ptr, dev_result_ptr, total_job_size*sizeof(uint8_t), cudaMemcpyDeviceToHost);

    cudaFree(dev_result_ptr); 
    cudaFree(dev_full_hash_list); 

    return result_ptr;
}

#define test_size 10000
int main() {
    hash_type* full_hash_list;
    full_hash_list = (hash_type*)malloc(test_size*sizeof(hash_type));
    for (int i = 0; i< test_size; i++) {
        for (int j = 0; j < 100; j++) {
            full_hash_list[i].data[j] = -1;
        }
        for (int j = 100; j < 1024; j++) {
            full_hash_list[i].data[j] = i*1024+j;
        }
    }

    auto started = std::chrono::high_resolution_clock::now();

    uint8_t* result = GPU_compute_distances(full_hash_list, test_size);

    auto done = std::chrono::high_resolution_clock::now();
    std::cout << "Task duration (msec):" << std::chrono::duration_cast<std::chrono::milliseconds>(done-started).count() << "n";

    for (int i = 0; i< 10; i++) {
        std::cout << (uint32_t)result[i] << "n";
    }

    free(full_hash_list);
    free(result);

    return 0;
}

Where

uint32_t i = (uint32_t)(nb_of_hashes+0.5-sqrtf((nb_of_hashes-0.5)*(nb_of_hashes-0.5) - 2*job_id));
uint32_t j = i + job_id - ((i-1)*(nb_of_hashes-(double)i/2));
i--;

Is a formula that allows to translate a comparison job_id located in the upper half matrix of data to the pointer offsets. For example in a sample list of 5 vectors:

x 0 1 2 3
x x 4 5 6
x x x 7 8
x x x x 9
x x x x x

0 -> [0][1], 1 -> [0][2], etc. And is not really important to my performance issue.

My first attempt was with (10000x10000-10000)/2 calls to thrust::set_intersection passing the hashes by value.

nvprof gave me a lot of cudaMalloc/cudaFree calls and the execution was pretty slow (magnitude of times slower compared to the pure CUDA code in the example).

My second attempt was with one call to thrust::for_each_n passing the pointers to the pairs of vectors to compare. This implementation was pretty close to the one in the sample but it did not give me the possibility to experiment with different kernel parameters.

The last and fastest is the one presented in the sample code.

I also tried to reduce the number of threads per block to 4 and copy the 2 vectors to local vector and vector array[4] with the shared decorators, followed by a __syncthreads() call. The fact that I could only start 4 threads per block was a performance killer without surprise.

So the current implementation with the best performance is the one that leaves the management of cache completely to the GPU and its L2 cache.

I also tried to compile with L1+L2 cache and lost about 25% performance compared to L2 cache only.

I investigated also the possibility of CUDA texture memory but did not push it into a working sample to test as from the online documentation I read it seemed to me not really well suited because I can't find a suitable addressing pattern. But I might perfectly be wrong here.

链接地址: http://www.djcxy.com/p/25402.html

上一篇: 使用Spring和AspectJ进行可配置的组件

下一篇: 许多GPU上设置交叉点