Place Radix Sort

This is a long text. Please bear with me. Boiled down, the question is: Is there a workable in-place radix sort algorithm ?


Preliminary

I've got a huge number of small fixed-length strings that only use the letters “A”, “C”, “G” and “T” (yes, you've guessed it: DNA) that I want to sort.

At the moment, I use std::sort which uses introsort in all common implementations of the STL. This works quite well. However, I'm convinced that radix sort fits my problem set perfectly and should work much better in practice.

Details

I've tested this assumption with a very naive implementation and for relatively small inputs (on the order of 10,000) this was true (well, at least more than twice as fast). However, runtime degrades abysmally when the problem size becomes larger (N > 5,000,000).

The reason is obvious: radix sort requires copying the whole data (more than once in my naive implementation, actually). This means that I've put ~ 4 GiB into my main memory which obviously kills performance. Even if it didn't, I can't afford to use this much memory since the problem sizes actually become even larger.

Use Cases

Ideally, this algorithm should work with any string length between 2 and 100, for DNA as well as DNA5 (which allows an additional wildcard character “N”), or even DNA with IUPAC ambiguity codes (resulting in 16 distinct values). However, I realize that all these cases cannot be covered, so I'm happy with any speed improvement I get. The code can decide dynamically which algorithm to dispatch to.

Research

Unfortunately, the Wikipedia article on radix sort is useless. The section about an in-place variant is complete rubbish. The NIST-DADS section on radix sort is next to nonexistent. There's a promising-sounding paper called Efficient Adaptive In-Place Radix Sorting which describes the algorithm “MSL”. Unfortunately, this paper, too, is disappointing.

In particular, there are the following things.

First, the algorithm contains several mistakes and leaves a lot unexplained. In particular, it doesn't detail the recursion call (I simply assume that it increments or reduces some pointer to calculate the current shift and mask values). Also, it uses the functions dest_group and dest_address without giving definitions. I fail to see how to implement these efficiently (that is, in O(1); at least dest_address isn't trivial).

Last but not least, the algorithm achieves in-place-ness by swapping array indices with elements inside the input array. This obviously only works on numerical arrays. I need to use it on strings. Of course, I could just screw strong typing and go ahead assuming that the memory will tolerate my storing an index where it doesn't belong. But this only works as long as I can squeeze my strings into 32 bits of memory (assuming 32 bit integers). That's only 16 characters (let's ignore for the moment that 16 > log(5,000,000)).

Another paper by one of the authors gives no accurate description at all, but it gives MSL's runtime as sub-linear which is flat out wrong.

To recap : Is there any hope of finding a working reference implementation or at least a good pseudocode/description of a working in-place radix sort that works on DNA strings?


Well, here's a simple implementation of an MSD radix sort for DNA. It's written in D because that's the language that I use most and therefore am least likely to make silly mistakes in, but it could easily be translated to some other language. It's in-place but requires 2 * seq.length passes through the array.

void radixSort(string[] seqs, size_t base = 0) {
    if(seqs.length == 0)
        return;

    size_t TPos = seqs.length, APos = 0;
    size_t i = 0;
    while(i < TPos) {
        if(seqs[i][base] == 'A') {
             swap(seqs[i], seqs[APos++]);
             i++;
        }
        else if(seqs[i][base] == 'T') {
            swap(seqs[i], seqs[--TPos]);
        } else i++;
    }

    i = APos;
    size_t CPos = APos;
    while(i < TPos) {
        if(seqs[i][base] == 'C') {
            swap(seqs[i], seqs[CPos++]);
        }
        i++;
    }
    if(base < seqs[0].length - 1) {
        radixSort(seqs[0..APos], base + 1);
        radixSort(seqs[APos..CPos], base + 1);
        radixSort(seqs[CPos..TPos], base + 1);
        radixSort(seqs[TPos..seqs.length], base + 1);
   }
}

Obviously, this is kind of specific to DNA, as opposed to being general, but it should be fast.

Edit:

I got curious whether this code actually works, so I tested/debugged it while waiting for my own bioinformatics code to run. The version above now is actually tested and works. For 10 million sequences of 5 bases each, it's about 3x faster than an optimized introsort.


I've never seen an in-place radix sort, and from the nature of the radix-sort I doubt that it is much faster than a out of place sort as long as the temporary array fits into memory.

Reason:

The sorting does a linear read on the input array, but all writes will be nearly random. From a certain N upwards this boils down to a cache miss per write. This cache miss is what slows down your algorithm. If it's in place or not will not change this effect.

I know that this will not answer your question directly, but if sorting is a bottleneck you may want to have a look at near sorting algorithms as a preprocessing step (the wiki-page on the soft-heap may get you started).

That could give a very nice cache locality boost. A text-book out-of-place radix sort will then perform better. The writes will still be nearly random but at least they will cluster around the same chunks of memory and as such increase the cache hit ratio.

I have no idea if it works out in practice though.

Btw: If you're dealing with DNA strings only: You can compress a char into two bits and pack your data quite a lot. This will cut down the memory requirement by factor four over a naiive representation. Addressing becomes more complex, but the ALU of your CPU has lots of time to spend during all the cache-misses anyway.


Based on dsimcha's code, I've implemented a more generic version that fits well into the framework we use (SeqAn). Actually, porting the code was completely straightforward. Only afterwards did I find that there are actually publications concerning this very topic. The great thing is: they basically say the same as you guys. A paper by Andersson and Nilsson on Implementing Radixsort is definitely worth the read. If you happen to know German, be sure to also read David Weese's diploma thesis where he implements a generic substring index. Most of the thesis is devoted to a detailed analysis of the cost of building the index, considering secondary memory and extremely large files. The results of his work have actually been implemented in SeqAn, only not in those parts where i needed it.

Just for fun, here's the code I've written (I don't think anyone not using SeqAn will have any use for it). Notice that it still doesn't consider radixes greater 4. I expect that this would have a huge impact on performance but unfortunately I simply don't have the time right now to implement this.

The code performs more than twice as fast as Introsort for short strings. The break-even point is at a length of about 12–13. The type of string (eg whether it has 4, 5, or 16 different values) is comparatively unimportant. Sorting > 6,000,000 DNA reads from chromosome 2 of the human genome takes just over 2 seconds on my PC. Just for the record, that's fast! Especially considering that I don't use SIMD or any other hardware acceleration. Furthermore, valgrind shows me that the main bottleneck is operator new in the string assignments. It gets called about 65,000,000 times – ten times for each string! This is a dead giveaway that swap could be optimized for these strings: instead of making copies, it could just swap all characters. I didn't try this but I'm convinced that it would make a hell of a difference. And, just to say it again, in case someone wasn't listening: the radix size has nearly no influence on runtime – which means that I should definitely try to implement the suggestion made by FryGuy, Stephan and EvilTeach.

Ah yes, by the way: cache locality is a noticeable factor : Starting at 1M strings, the runtime no longer increases linearly. However, this could be fixed quite easily: I use insertion sort for small subsets (<= 20 strings) – instead of mergesort as suggested by the random hacker. Apparently, this performs even better than mergesort for such small lists (see the first paper I linked).

namespace seqan {

template <typename It, typename F, typename T>
inline void prescan(It front, It back, F op, T const& id) {
    using namespace std;
    if (front == back) return;
    typename iterator_traits<It>::value_type accu = *front;
    *front++ = id;
    for (; front != back; ++front) {
        swap(*front, accu);
        accu = op(accu, *front);
    }
}

template <typename TIter, typename TSize, unsigned int RADIX>
inline void radix_permute(TIter front, TIter back, TSize (& bounds)[RADIX], TSize base) {
    for (TIter i = front; i != back; ++i)
        ++bounds[static_cast<unsigned int>((*i)[base])];

    TSize fronts[RADIX];

    std::copy(bounds, bounds + RADIX, fronts);
    prescan(fronts, fronts + RADIX, std::plus<TSize>(), 0);
    std::transform(bounds, bounds + RADIX, fronts, bounds, plus<TSize>());

    TSize active_base = 0;

    for (TIter i = front; i != back; ) {
        if (active_base == RADIX - 1)
            return;
        while (fronts[active_base] >= bounds[active_base])
            if (++active_base == RADIX - 1)
                return;
        TSize current_base = static_cast<unsigned int>((*i)[base]);
        if (current_base <= active_base)
            ++i;
        else
            std::iter_swap(i, front + fronts[current_base]);
        ++fronts[current_base];
    }
}

template <typename TIter, typename TSize>
inline void insertion_sort(TIter front, TIter back, TSize base) {
    typedef typename Value<TIter>::Type T;
    struct {
        TSize base, len;
        bool operator ()(T const& a, T const& b) {
            for (TSize i = base; i < len; ++i)
                if (a[i] < b[i]) return true;
                else if (a[i] > b[i]) return false;
            return false;
        }
    } cmp = { base, length(*front) }; // No closures yet. :-(

    for (TIter i = front + 1; i != back; ++i) {
        T value = *i;
        TIter j = i;
        for ( ; j != front && cmp(value, *(j - 1)); --j)
            *j = *(j - 1);
        if (j != i)
            *j = value;
    }
}

template <typename TIter, typename TSize, unsigned int RADIX>
inline void radix(TIter top, TIter front, TIter back, TSize base, TSize (& parent_bounds)[RADIX], TSize next) {
    if (back - front > 20) {
        TSize bounds[RADIX] = { 0 };
        radix_permute(front, back, bounds, base);

        // Sort current bucket recursively by suffix.
        if (base < length(*front) - 1)
            radix(front, front, front + bounds[0], base + 1, bounds, static_cast<TSize>(0));
    }
    else if (back - front > 1)
        insertion_sort(front, back, base);

    // Sort next buckets on same level recursively.
    if (next == RADIX - 1) return;
    radix(top, top + parent_bounds[next], top + parent_bounds[next + 1], base, parent_bounds, next + 1);
}

template <typename TIter>
inline void radix_sort(TIter front, TIter back) {
    typedef typename Container<TIter>::Type TStringSet;
    typedef typename Value<TStringSet>::Type TString;
    typedef typename Value<TString>::Type TChar;
    typedef typename Size<TStringSet>::Type TSize;

    TSize const RADIX = ValueSize<TChar>::VALUE;
    TSize bounds[RADIX];

    radix(front, front, back, static_cast<TSize>(0), bounds, RADIX - 1);
}

} // namespace seqan
链接地址: http://www.djcxy.com/p/1910.html

上一篇: 我如何排序一个NSMutableArray中的自定义对象?

下一篇: 放置基数排序