Suffix tree of large (10Mb) text taking excessive memory

I implemented (see code below) the absolute minimal generalized suffix tree building algorithm. I wrote a unit test and it seems to work as expected (finds the right substring at the right position) . But the tree is impractically large. Question: Did I make a mistake somewhere, or are suffix trees in this basic form only usable for very short texts?

Statistics

I want to use this to search in large bodies of text: multiple 15-20Mb text files, or eg 40'000 string ~60 characters each.

When I build the 40'000 string 2.5Mb suffix tree (it's a list if firm names), the tree takes 400Mb. I could probably optimize it about 4x, but even then it'd be more than 40 bytes per character of original text. Is that normal? The estimates in literature suggest to me this is high.

Looking at the average branching factor per level of the tree, they are: 80 40 8 3 2 1 1 1 ... ie only the first 3-5 levels of the tree actually branch. I'm much better off building 3-5 levels and keeping long text suffix nodes. Everything below 5th level is basically linked lists of characters. Is that expected performance? It makes intuitive sense, there aren't many 6+ character substrings shared between firm names.

In further experiment with 15MB text, I run out of 2Gb of memory after I add the first 10'000 suffixes (long to short). There's almost no repeating substrings, so the tree cannot reuse much. I can absolutely see how suffix array is practically useful, it takes fixed 2-4 bytes per character, and requires only 24 string compares per search in 20Mb of text. I can't possibly see how suffix tree with as many vertices as there are unique substrings in text can fit in memory. It's O(n^2) nodes for a string with all unique characters, but it seems still very superlinear for an English text. How can suffix tree work for large texts? The papers and questions I read seem to imply it's supposed to be usable, hence my questions here.

Questions

Did I make a mistake somewhere that makes the tree bigger than it should be?

Is it correct that the manner of building suffix tree should have no bearing on the final tree shape?

Mine is not Ukkonen algorithm but just brute-force adding all suffixes to the tree to simplify the code and data structure (no suffix link field) and to compare perf to Ukkonen later. The method of building should only affect the speed of construction, not the size or shape of the tree. Either way, building the tree is incremental, so there's no intermediate result that's bigger than final tree.

Here's the code:

#include <vector>
#include <assert.h>
class Node 
{public:
    char c; // 0 is terminator. terminators have no children
    Node(char _c) { c = _c; }
};

class Terminator 
{
public:
    Terminator(int i, int p ) { id = i; pos = p; }
    bool operator == (const Terminator &that)const { return id == that.id && pos == that.pos; }
    int id; // id of the string
    int pos; //position of this substring in the string
};


class Vertex : public Node 
{
public:
    static size_t s_nCount;
    std::vector< Vertex* > children; // interior tree nodes; char != 0
    std::vector< Terminator >terminators;
    Vertex(char c) :Node(c) { s_nCount++; }
    ~Vertex() { for (Vertex*v : children) delete v; }
    //void* operator new  (size_t count) { return g_GstAlloc.Alloc(count); }
    //void operator delete(void *p) { g_GstAlloc.Free(p); }
    void getDepthCounts(std::vector<unsigned> &depth, size_t nLevel = 0)const
    {
        if (depth.size() <= nLevel)
            depth.resize(nLevel + 1);
        depth[nLevel]++;
        for (Vertex*v : children)
            v->getDepthCounts(depth,nLevel + 1);
    }
    Vertex *getOrCreateChildVertex(char c )
    {
        Vertex *out = getChild(c);
        if (!out)
        {
            out = new Vertex(c);
            children.push_back(out);
        }
        return out;
    }

    void getTerminators(std::vector<Terminator> &out, size_t limit )
    {
        if (out.size() >= limit)
            return;
        out.insert(out.end(), terminators.begin(), terminators.end());;
        for (Vertex* c: children) {
            c->getTerminators(out, limit);
            if (out.size() >= limit)
                break;
        }
    }
    Vertex *getChild(char c) 
    {
        for (Vertex *p : children)
            if (p->c == c)
                return p;
        return nullptr;
    }
    size_t memSize()const
    {
        size_t out = sizeof(*this) + terminators.size() * sizeof(terminators[0]);
        for (Vertex*v : children)
            out += sizeof(v) + v->memSize();
        return out;
    }
};

class Root : public Vertex 
{
public:
    Root():Vertex(0) {  }
    void appendString(const char *str, int id )
    {
        for (volatile size_t len = strlen(str), suffix = len; suffix-- > 0;)
        {
            Vertex* parent = this;
            for (size_t pos = suffix; pos < len; pos++)
            {
                parent = parent->getOrCreateChildVertex(str[pos]);
            }
            parent->terminators.push_back(Terminator(id, (int)suffix));
        }
    }

    void findSubstr(std::vector<Terminator> &out, const char *substr, size_t limit )
    {
        Vertex *parent = this;
        for (size_t len = strlen( substr ), i = 0; i < len; i++)
        {
            parent = parent->getChild(substr[i]);
            if (!parent)
                return;
        }
        parent->getTerminators(out, limit);
    }
};

Prompted by Bo Persson comment, I've re-read the wikipedia article on Suffix Trees and it finally clicked. I may still misunderstand suffix trees, but this particular explosion of memory is caused by me building an expanded tree (let's call it character tree) with a single character at every node. That's an incorrect format.

Ukkonen's algorithm, as well as all the other suffix tree building algorithms, build the tree in O(n) time, which contains a hint. a string ABCD....XYZ will have character tree with O(n^2) nodes, which is obviously impossible to build in O(n) time.

The proper suffix tree must have nodes with unique substrings (in the wiki page, BANANA$ is one node rather than 6 nodes plus a terminator). It takes fixed memory (index of the first character and length, for example).

Ukkonen's algorithm insight is in optimizing the case of AAAAAA...AAAAA string. Such a string has O(n) suffix tree nodes (and also O(n) "character tree" nodes), and naiive building takes O(n^2) time because you have to follow the growing string of nodes on every step. However, with Ukkonen's suffix links adding each suffix takes O(1) operations (it just adds a node at the end and after following suffix link once, it stops).

In case of ABCD...XYZ string, the proper suffix tree will still have O(n) nodes. Eg the suffix FGH...XYZ will only have one node.

In my code, that suffix will generate 21 separate nodes. That's what creates the O(n^2) memory consumption, basically my misunderstanding of suffix tree node being a substring rather than one character.

链接地址: http://www.djcxy.com/p/40106.html

上一篇: 后缀树的建设

下一篇: 大文本(10Mb)的后缀树占用过多的内存