Algorithm for finding basis of a set of bitstrings?

2018-06-03 04:38:49

This is for a diff utility I'm writing in C++.

I have a list of n character-sets {"a", "abc", "abcde", "bcd", "de"} (taken from an alphabet of k=5 different letters). I need a way to observe that the entire list can be constructed by disjunctions of the character-sets {"a", "bc", "d", "e"}. That is, "b" and "c" are linearly dependent, and every other pair of letters is independent.

In the bit-twiddling version, the character-sets above are represented as {10000, 11100, 11111, 01110, 00011}, and I need a way to observe that they can all be constructed by ORing together bitstrings from the smaller set {10000, 01100, 00010, 00001}.

In other words, I believe I'm looking for a "discrete basis" of a set of n different bit-vectors in {0,1}k. This paper claims the general problem is NP-complete... but luckily I'm only looking for a solution to small cases (k < 32).

I can think of really stupid algorithms for generating the basis. For example: For each of the k2 pairs of letters, try to demonstrate (by an O(n) search) that they're dependent. But I really feel like there's an efficient bit-twiddling algorithm that I just haven't stumbled upon yet. Does anyone know it?

EDIT: I ended up not really needing a solution to this problem after all. But I'd still like to know if there is a simple bit-twiddling solution.

I'm thinking a disjoint set data structure, like union find turned on it's head (rather than combining nodes, we split them).

Algorithm:

Create an array main where you assign all the positions to the same group, then:

for each bitstring curr
  for each position i
    if (curr[i] == 1)
      // max of main can be stored for constant time access
      main[i] += max of main from previous iteration

Then all the distinct numbers in main are your different sets (possibly using the actual union-find algorithm).

Example:

So, main = 22222 . (I won't use 1 as groups to reduce possible confusion, as curr uses bitstrings).

curr = 10000
main = 42222 // first bit (=2) += max (=2)

curr = 11100
main = 86622 // first 3 bits (=422) += max (=4)

curr = 11111
main = 16-14-14-10-10

curr = 01110
main = 16-30-30-26-10

curr = 00011
main = 16-30-30-56-40

Then split by distinct numbers:

{10000, 01100, 00010, 00001}

Improvement:

To reduce the speed at which main increases, we can replace

main[i] += max of main from previous iteration

with

main[i] += 1 + (max - min) of main from previous iteration

EDIT: Edit based on j_random_hacker's comment

You could combine the passes of the stupid algorithm at the cost of space.

Make a bit vector called violations that is (k - 1) k / 2 bits long (so, 496 for k = 32 .) Take a single pass over character sets. For each, and for each pair of letters, look for violations (ie XOR the bits for those letters, OR the result into the corresponding position in violations .) When you're done, negate and read off what's left.

You could give Principal Component Analysis a try. There are some flavors of PCA designed for binary or more generally for categorical data.

链接地址: http://www.djcxy.com/p/11024.html

上一篇: 关于Jenkins JUnit XML格式

下一篇: 用于查找一组位串的基础的算法？