How to compress many strings across a data structure?

2018-06-13 23:27:11

I have a 500GB collection of XML documents that I'm indexing. I'm currently only able to index 6GB of this collection with 32GB of RAM.

My index structure is a HashMap<String, PatriciaTrie<String, Integer>> , where the first string represents a term and the second string is of the format filepath+XPath , with the final integer representing the number of occurrences.

I used a trie to reduce the shared prefix and because I need the data sorted. It helped a little with compression, but it wasn't enough.

The total collection of filepath+XPath strings is somewhere between 1TB and 4TB within this data structure. I need to be able to compress this data structure entirely into memory. The target machine has 256GB RAM and 16 CPU cores. Less memory has multiple added benefits (such as reducing cold start time). Index time isn't such a big deal.

The XPaths represent about 250 total node types.

The approach I'm currently working on will build a Huffman table for each series of 2 tags, based on the tags that can possibly occur next. Often, this cuts the options down to about 4 or 5, which allows the XPath to be encoded into a much shorter bitstring, which can then be encoded as bytes.

The strings are typically 40-600 bytes (UTF-8), and I believe this should reduce everything after the filepath prefix (the first 40 characters, which are compressed by the trie) into at max 12 bytes (the deepest point on the tree is about 12 nodes deep, and each node is at worst 1 char to represent) for the structure and 12 bytes for the indexes (variable byte encoding, with very few elements containing indexes above 256), producing strings that are usually in the range 40-64 bytes.

I think this is a good approach, but I think I may be missing something.

Is there a better approach for compressing this data structure or the data that goes into it?

How do people usually compress many strings across the same data structure?

Is there any existing solution that compresses many strings independently based on the whole collection?

After the strings are in the data structure like this, are there any good techniques for compressing the tries based on the structure shared between them?

I think your biggest problem here is that you're storing too much data for each term. You don't say how many unique terms you have or how many individual files, but I'll give some example numbers.

Say you have 200,000 unique terms across 200 different files. So every unique term carries the weight of at least one file path, or 40 bytes. And that's before you start indexing anything.

You should be able to compress this data into a table of filepath+Xpath strings, and a list of terms, each of which contains references to entries in that table. So, for example, you might have:

Path table:

index   Path
  1   file+xpath1
  2   file+xpath2
  3   file+xpath3
  ...
999   file+xpath999

Terms

term  references
foo   1, 19, 27, 33, 297
bar   99, 864, 865
...

Now, your paths table is probably still way too large. The first think you can do is to build a files table and make the first part of the paths entry an index into the files table. So you end up with:

Files

  1  file1.xml
  2  file2.xml
 ...
999  file999.xml

And then your paths become:

  1  1,xpathA
  2  1,xpathB
  3  2,xpathQ
  ...

If you need more compression after that, build a string table that contains the xpath terms, and your paths entries become a series of indexes into that table. You have to be careful here, though, because allocation overhead for arrays or lists is going to make short lists very expensive. If you go this route, then you'll want to encode the paths list as one big binary array, and index into it. For example.

Words list

1 the
2 quick
3 brown
4 fox

Paths

index  path
0      1(index of file),2(quick),4(fox),-1(terminator)
4      3(index of file),3(brown),-1(terminator)
7      etc . . .

The Paths table is just a big array that would look like this:

1,2,4,-1,3,3,-1,...

This minimizes data storage cost because no string is ever stored more than once. All you have is string tables and references to those strings. The amount of space it takes will be something like:

Combined length of all file names
Combined length of all path segment terms
(number of paths) * (average path length) * (size of integer index)
(number of terms) * (average number of references per term) * (size of integer index)

Building this in memory might be possible. It's hard to say without knowing how many individual terms you have. You'll need dictionaries for the file names, the paths, and the individual path segments if you use the words list. But it can all be done in a single pass if you have the memory.

If you don't have enough memory for the whole tree while you're building, you can load the file names and maintain the paths table in memory. As you find each term in a file, write it to disk along with its path reference. You end up with a disk file that looks like:

term, path reference
term, path reference
...

Use an external sort program to sort by term, and then go through and combine duplicates. When you're done you end up with a file that contains:

File names table
Path segments table
Paths
terms

Lookup is really easy. Find the term, look up each reference in the paths table, and decode the path by indexing into the file names and path segments tables.

I used something like this a few years back and it worked quite well. You should be able to write a program that analyzes your data to come up with the numbers (unique paths, number of file names, average number of references per term, etc.). From there, you can easily determine if using this technique will work for you.

链接地址: http://www.djcxy.com/p/39808.html

上一篇: 查找一个字符串是否包含集合中的任何字符串

下一篇: 如何跨数据结构压缩很多字符串？