mostly data structure to compress and search source code

Suppose you have a lot of source code (like 50GB+) in popular languages (Java, C, C++, etc).

The project needs are:

  • compressing source code to reduce disk use and disk I/O

  • indexing it in such way that particular source file can be extracted from the compressed source without decompressing the whole thing

  • compression time for the whole codebase is not important

  • search and retrieval time (and memory use when searching and retrieving) are important

  • This SO answer contains potential answers: What are the lesser known but useful data structures?

    However, this is just a list of potentials - I do not know how those structures actually evaluate against requirements listed above.

    Question: what are data structures (and their implementations) that would perform well according to the aforementioned requirements?


    The main data-structure used for searching is the inverted list. Fortunately, you don't need to implement it yourself. Lucene is a widely used search tool which works with inverted lists internally.

    Using Lucene you can create a document with multiple fields. The idea is that some of these fields will be searchable with standard keyword-type queries.

    I've implemented a source code search utility which I'll now describe briefly in the following paragraphs. The entire source code itself is stored as a non-indexable field named "code" (you can modify the source to store a compressed version).

    For the retrieval part, note that the keywords that you are going to use for the search could be names of function, classes, packages or variables. They could also be words from the comments and so on. In my implementation, I extracted these information from using a Java annotated syntax tree (AST). You could do the same for other languages as well by making use of an appropriate parser to construct an AST.

    Another possibility is the query-by-example (QBE) paradigm, where you could use a small snippet of code to search approximately similar snippets from your indexed code-base. This is particularly helpful for detecting source-code reuse and plagiarism, the main purpose for which I developed the tool.

    The project page is here. I call it the YASOCS (Yet Another SOurce Code Searcher).

    The search is very fast since it uses the inverted list. You could also use Luke (an open source Lucene index visualizer) to "see" the index yourself and execute test queries using the interface.

    链接地址: http://www.djcxy.com/p/39858.html

    上一篇: 用lambda表达式实现具有两个抽象方法的接口

    下一篇: 主要是数据结构来压缩和搜索源代码