Data structure and algorithm for representing/allocating free space in a file

I have a file with "holes" in it and want to fill them with data; I also need to be able to free "used" space and make free space.

I was thinking of using a bi-map that maps offset and length. However, I am not sure if that is the best approach if there are really tiny gaps in the file. A bitmap would work but I don't know how that can be easily switched to dynamically for certain regions of space. Perhaps some sort of radix tree is the way to go?

For what it's worth, I am up to speed on modern file system design (ZFS, HFS+, NTFS, XFS, ext...) and I find their solutions woefully inadequate.

My goals are to have pretty good space savings (hence the concern about small fragments). If I didn't care about that, I would just go for two splay trees... One sorted by offset and the other sorted by length with ties broken by offset. Note that this gives you amortized log(n) for all operations with a working set time of log(m)... Pretty darn good... But, as previously mentioned, does not handle issues concerning high fragmentation.


I have shipped commercial software that does just that. In the latest iteration, we ended up sorting blocks of the file into "type" and "index," so you could read or write "the third block of type foo." The file ended up being structured as:

1) File header. Points at master type list. 2) Data. Each block has a header with type, index, logical size, and padded size. 3) Arrays of (offset, size) tuples for each given type. 4) Array of (type, offset, count) that keeps track of the types.

We defined it so that each block was an atomic unit. You started writing a new block, and finished writing that before starting anything else. You could also "set" the contents of a block. Starting a new block always appended at the end of the file, so you could append as much as you wanted without fragmenting the block. "Setting" a block could re-use an empty block.

When you opened the file, we loaded all the indices into RAM. When you flushed or closed a file, we re-wrote each index that changed, at the end of the file, then re-wrote the index index at the end of the file, then updated the header at the front. This means that changes to the file were all atomic -- either you commit to the point where the header is updated, or you don't. (Some systems use two copies of the header 8 kB apart to preserve headers even if a disk sector goes bad; we didn't take it that far)

One of the block "types" was "free block." When re-writing changed indices, and when replacing the contents of a block, the old space on disk was merged into the free list kept in the array of free blocks. Adjacent free blocks were merged into a single bigger block. Free blocks were re-used when you "set content" or for updated type block indices, but not for the index index, which always was written last.

Because the indices were always kept in memory, working with an open file was really fast -- typically just a single read to get the data of a single block (or get a handle to a block for streaming). Opening and closing was a little more complex, as it needed to load and flush the indices. If it becomes a problem, we could load the secondary type index on demand rather than up-front to amortize that cost, but it never was a problem for us.

Top priority for persistent (on disk) storage: Robustness! Do not lose data even if the computer loses power while you're working with the file! Second priority for on-disk storage: Do not do more I/O than necessary! Seeks are expensive. On Flash drives, each individual I/O is expensive, and writes are doubly so. Try to align and batch I/O. Using something like malloc() for on-disk storage is generally not great, because it does too many seeks. This is also a reason I don't like memory mapped files much -- people tend to treat them like RAM, and then the I/O pattern becomes very expensive.


For memory management I am a fan of the BiBOP* approach, which is normally efficient at managing fragmentation.

The idea is to segregate data based on their size. This, way, within a "bag" you only have "pages" of small blocks with identical sizes:

  • no need to store the size explicitly, it's known depending on the bag you're in
  • no "real" fragmentation within a bag
  • The bag keeps a simple free-list of the available pages. Each page keeps a free-list of available storage units in an overlay over those units.

    You need an index to map size to its corresponding bag.

    You also need a special treatment for "out-of-norm" requests (ie requests that ask for allocation greater than the page size).


    This storage is extremely space efficient, especially for small objects, because the overhead is not per-object, however there is one drawback: you can end-up with "almost empty" pages that still contain one or two occupied storage units.

    This can be alleviated if you have the ability to "move" existing objects. Which effectively allows to merge pages.

    (*) BiBOP: Big Bag Of Pages


    I would recommend making customized file-system (might contain one file of course), based on FUSE. There are a lot of available solutions for FUSE you can base on - I recommend choosing not related but simplest projects, in order to learn easily.

    What algorithm and data-structure to choose, it highly deepens on your needs. It can be : map, list or file split into chunks with on-the-fly compression/decompression.

    Data structures proposed by you are good ideas. As you clearly see there is a trade-off: fragmentation vs compaction.

    On one side - best compaction, highest fragmentation - splay and many other kinds of trees.

    On another side - lowest fragmentation, worst compaction - linked list.

    In between there are B-Trees and others.

    As you I understand, you stated as priority: space-saving - while taking care about performance.

    I would recommend you mixed data-structure in order to achieve all requirements.

  • a kind of list of contiguous blocks of data
  • a kind of tree for current "add/remove" operation
  • when data are required on demand, allocate from tree. When deleted, keep track what's "deleted" using tree as well.
  • mixing -> during each operation (or on idle moments) do "step by step" de-fragmentation, and apply changes kept in tree to contiguous blocks, while moving them slowly.
  • This solution gives you fast response on demand, while "optimising" stuff while it's is used, (For example "each read of 10MB of data -> defragmantation of 1MB) or in idle moments.

    链接地址: http://www.djcxy.com/p/4580.html

    上一篇: Nhibernate:如何查找SqlDateTime溢出异常的负责字段

    下一篇: 表示/分配文件中空闲空间的数据结构和算法