What are probabilistic data structures?

2018-05-31 09:06:21

I have read about data structures like bloom filters and skip lists.

What are the common characteristics of probabilistic data structures and what are they used for?

There are probably a lot of different (and good) answers, but in my humble opinion, the common characteristics of probabilistic data structures is that they provide you with approximate, not precise answer.

How many items are here? About 1523425 with probability of 99%

Update: Quick search produced link to decent article on the issue:

https://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers).

In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values. The advantages they bring:

they use small amount of memory (you can control how much)

they can be easily parallelizable (hashes are independent)

they have constant query time (not even amortized constant like in dictionary)

Frequently used probabilistic data structures are:

bloom filters

count-min sketch

hyperLogLog

There is a list of probabilistic data structures in wikipedia for your reference: https://en.wikipedia.org/wiki/Category:Probabilistic_data_structures

There are different definitions about what "probabilistic data structure" is. IMHO, probabilistic data structure means that the data structure uses some randomized algorithm or takes advantage of some probabilistic characteristics internally, but they don't have to behave probabilistically or un-deterministically from the data structure user's perspective.

There are many "probabilistic data structures" with probabilistically behavior such as the bloom filter and HyperLogLog mentioned by the other answers.

At the same time, there are other "probabilistic data structures" with determined behavior (from a user's perspective) such as skip list. For skip list, users can use it similarly as a balanced binary search tree but is implemented with some probability related idea internally. And according to skip list's author William Pugh:

Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds as balanced trees and are simpler, faster and use less space.

链接地址: http://www.djcxy.com/p/6788.html

上一篇: 斐波那契数列的计算复杂性

下一篇: 什么是概率数据结构？