How to improve on this implementation of the radix

2018-06-25 06:51:09

I'm implementing a 2-byte Radix Sort. The concept is to use Counting Sort, to sort the lower 16 bits of the integers, then the upper 16 bits. This allows me to run the sort in 2 iterations. The first concept I had was trying to figure out how to handle negatives. Since the sign bit would be flipped for negative numbers, then in hex form, that would make negatives greater than the positives. To combat this I flipped the sign bit when it was positive, in order to make [0, 2 bil) = [128 000 000 000, 255 255...). And when it was negative I flipped all the bits, to make it range from (000 000 .., 127 255 ..). This site helped me with that information. To finish it off, I would split the integer into either the top or bottom 16-bits based on the pass. The following is the code allowing me to do that.

static uint32_t position(int number, int pass) {
    int mask;
    if (number <= 0) mask = 0x80000000;
    else mask = (number >> 31) | 0x80000000;
    uint32_t out = number ^ mask;
    return pass == 0 ? out & 0xffff : (out >> 16) & 0xffff;
}

To start the actual Radix Sort, I needed to form a histogram of size 65536 elements. The problem I ran across was when the number of elements inputted was very large. It would take a while to create the histogram, so I implemented it in parallel, using processes and shared memory. I partitioned the array into subsections of size / 8. Then over an array of shared memory sized 65536 * 8, I had each process create its own histogram. Afterwards, I summed it all together to form a single histogram. The following is the code for that:

for (i=0;i<8;i++) {
    pid_t pid = fork();
    if (pid < 0) _exit(0);
    if (pid == 0) {
        const int start = (i * size) >> 3;
        const int stop  = i == 7 ? size : ((i + 1) * size) >> 3;
        const int curr  = i << 16;
        for (j=start;j<stop;++j)
            hist[curr + position(array[j], pass)]++;
        _exit(0);
    }
}
for (i=0;i<8;i++) wait(NULL);

for (i=1;i<8;i++) {
    const int pos = i << 16;
    for (j=0;j<65536;j++)
        hist[j] += hist[pos + j];
}

The next part was where I spent most of my time analyzing how cache affected the performance of the prefix-sum. With an 8-bit and 11-bit pass Radix Sort, all of the histogram would fit within L1 cache. With 16-bits, it would only fit within L2 cache. In the end the 16-bit histogram ran the sum the fastest, since I only had to run 2 iterations with it. I also ran the prefix sum in parallel as per the CUDA website recommendations. At 250 million elements, this ran about 1.5 seconds slower than the 16-bit integer. So my prefix sum ended up looking like this:

for (i=1;i<65536;i++)
    hist[i] += hist[i-1];

The only thing left was to traverse backwards through the array and put all the elements into their respective spots in the temp array. Since I only had to go through twice, instead of copying from the temp back to array, and running the code again. I ran the sort first using array as the input, and temp as the output. Then ran it the second time using temp as the input and array as the output. This kept me from mem-copying back to array both times. The code looks like this for the actual sort:

histogram(array, size, 0, hist);
for (i=size-1;i>=0;i--)
    temp[--hist[position(array[i], 0)]] = array[i];

memset(hist, 0, arrSize);
histogram(temp, size, 1, hist);
for (i=size-1;i>=0;i--)
    array[--hist[position(temp[i], 1)]] = temp[i];

This link contains the full code that I have so far. I ran a test against quicksort, and it ran between 5 and 10 times faster with integers and floats, and about 5 times faster with 8-byte data types. Is there a way to improve on this?

My guess would be that treating the sign of the integers during operation is not worth it. It complexyfies and slows down your code. I'd go for a first sort as unsigned and then do a second path that just reorders the two halves and inverts the one of the negatives.

Also from your code I don't get how you have different processes operate together. How do you collect the histogram in the parent? you have a process shared variable? In any case using ptrhead would be much more appropriate, here.

链接地址: http://www.djcxy.com/p/70792.html

上一篇: 编写一个程序，从10亿个数字中找出100个最大的数字

下一篇: 如何改进这个基数的实现