Set Cover or Hitting Set; Numpy, Least element combinations to make up full set
My goal is to find the least possible number of sub sets [af] to make up the full set A.
A = set([1,2,3,4,5,6,7,8,9,10]) # full set
#--- below are sub sets of A ---
a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])
In reality the parent set A that I'm dealing with contains 15k unique elements, with 30k subsets and these sub sets range in lengths from a single unique element to 1.5k unique elements.
As of now the code I'm using looks more or less like the following and is PAINFULLY slow:
import random
B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)
Dict = {}
for i in Bx: # iterate through shuffled keys.
z = [i]
x = B[i]
L = len(x)
while L < len(A):
for ii in Bx:
x = x | B[ii]
Lx = len(x)
if Lx > L:
L = Lx
z.append(ii)
try:
Dict[len(z)].append(z)
except KeyError:
Dict[len(z)] = [z]
print Dict[min(Dict.keys()]
This is just to give an idea of the approach I've taken. For clarity I've left out some logic that minimizes iterations on sets that are already too large and other things like that.
I imagine Numpy would be REALLY good at this type of problem but I can't think of a way to use it.
The question is asking for an implementation of the Set Cover Problem, for which no fast algorithm exists to find the optimal solution. However, the greedy solution to the problem - to repeatedly choose the subset that contains the most elements that have not been covered yet - does a good job in reasonable time.
You can find an implementation of that algorithm in python at this previous question
Edited to add: @Aaron Hall's answer can be improved by using the below drop-in replacement for his greedy_set_cover routine. In Aaron's code, we calculate a score len(s-result_set)
for every remaining subset each time we want to add a subset to the cover. However, this score will only decrease as we add to the result_set; so if on the current iteration we have picked out a best-so-far set with a score higher than the remaining subsets achieved in previous iterations, we know their score cannot improve and can just ignore them. That suggests using a priority queue to store the subsets for processing; in python we can implement the idea with heapq
:
# at top of file
import heapq
#... etc
# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
max = len(parent_set)
# create the initial heap. Note 'subsets' can be unsorted,
# so this is independent of whether remove_redunant_subsets is used.
heap = []
for s in subsets:
# Python's heapq lets you pop the *smallest* value, so we
# want to use max-len(s) as a score, not len(s).
# len(heap) is just proving a unique number to each subset,
# used to tiebreak equal scores.
heapq.heappush(heap, [max-len(s), len(heap), s])
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
best = []
unused = []
while heap:
score, count, s = heapq.heappop(heap)
if not best:
best = [max-len(s - result_set), count, s]
continue
if score >= best[0]:
# because subset scores only get worse as the resultset
# gets bigger, we know that the rest of the heap cannot beat
# the best score. So push the subset back on the heap, and
# stop this iteration.
heapq.heappush(heap, [score, count, s])
break
score = max-len(s - result_set)
if score >= best[0]:
unused.append([score, count, s])
else:
unused.append(best)
best = [score, count, s]
add_set = best[2]
logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
results.append(add_set)
result_set.update(add_set)
# subsets that were not the best get put back on the heap for next time.
while unused:
heapq.heappush(heap, unused.pop())
return results
For comparison, here's the timings for Aaron's code on my laptop. I dropped remove_redundant_subsets as when we use the heap, dominated subsets are never reprocessed anyway:
INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46
And here's the timing with the code above; a bit over 3 times faster.
INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46
NB: these two algorithms process the subsets in different orders and will occasionally give different answers for the size of the set cover; this is just down to 'lucky' choices of subsets when scores are tied.
The priority queue/heap is a well known optimisation of the greedy algorithm, though I couldn't find a decent discussion of this to link to.
While the greedy algorithm is a quick-ish way to get an approximate answer, you can improve on the answer by spending time afterwards, knowing that we have an upper bound on the minimal set cover. Techniques to do so include simulated annealing, or branch-and-bound algorithms as illustrated in this article by Peter Norvig
Thanks for the question, I found it very interesting. I have tested the below code on Python 2.6, 2.7, and 3.3, you may find it interesting to run it yourself, I made it easy to paste into an interpreter or to run as a script.
Another solution here attempts to solve by brute force, ie going through every possible combination, which may be doable for ten elements, which the questioner gave as an example, but won't offer a solution for the parameters the questioner asked for, ie, choosing a combination of subsets (up to 1500 elements long, from a superset of 15000 elements) from a set of 30,000 sets. I found for those parameters, attempting to find a solution set where n = 40 (very unlikely) would mean searching many orders of combinations over one googol, which is quite impossible.
The Setup
Here I import some modules used to benchmark my functions and create the data. I also create a timer decorator to wrap the functions so I can easily measure how much time passes before the function completes (or I give up and interrupt the function).
import functools
import time
import logging
import random
# basic setup:
logging.basicConfig(level=logging.DEBUG) # INFO or DEBUG
random.seed(1)
PARENT_SIZE = 15000
MAX_SUBSET_SIZE = 1500
N_SUBSETS = 30000
def timer(f):
'''
timer wrapper modified, original obtained from:
http://stackoverflow.com/questions/5478351/python-time-measure-function
'''
@functools.wraps(f)
def wrap(*args):
time1 = time.time()
try:
ret = f(*args)
except KeyboardInterrupt:
time2 = time.time()
logging.info('{0} function interrupted after {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
else:
time2 = time.time()
logging.info('{0} function took {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
Data Creation Functions
Next I have to create the data:
@timer
def make_subsets(parent_set, n):
'''create list of subset sets, takes about 17 secs'''
subsets = []
for i in range(n): # use xrange in python 2
subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
return subsets
@timer
def include_complement(parent_set, subsets):
'''ensure no missing elements from parent, since collected randomly'''
union_subsets = set().union(*subsets)
subsets_complement = set(parent_set) - union_subsets
logging.info('len of union of all subsets was {0}'.format(
len(union_subsets)))
if subsets_complement:
logging.info('len of subsets_complement was {0}'.format(
len(subsets_complement)))
subsets.append(subsets_complement)
return subsets
Optional Preprocessing
I provide some preprocessing, it runs in a few seconds, but doesn't help much, only speeds up by a fraction of a second, but recorded here for the reader's edification:
@timer
def remove_redundant_subsets(subsets):
'''
without break, takes a while, removes 81 sets of len <= 4 (seed(0))
in 5.5 minutes, so breaking at len 10 for 4 second completion.
probably unnecessary if truly random subsets
but *may* be good if large subsets are subsets of others.
'''
subsets.sort(key=len)
remove_list = []
for index, s in enumerate(subsets, 1):
if len(s) > 10: # possible gain not worth continuing farther
break
if any(s.issubset(other) for other in subsets[index:]):
logging.debug('will remove subset: {s}'.format(s=s))
remove_list.append(s)
logging.info('subsets removing: {0}'.format(len(remove_list)))
for s in remove_list:
subsets.remove(s)
return subsets
Actual Function
Then I actually perform the Greedy Algorithm:
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
# maybe room for optimization here: Will still have to calculate.
# But custom max could shortcut subsets on uncovered more than len.
add_set = max(subsets, key=lambda x: len(x - result_set))
logging.debug('len of add_set is {0}'.format(len(add_set)))
results.append(add_set)
result_set.update(add_set)
return results
Here's the main() :
# full set, use xrange instead of range in python 2 for space efficiency
parent_set = range(PARENT_SIZE)
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
logging.debug('len of set {0} is {1}'.format(i, len(set)))
Final Results
And this provides a final result of 46(ish) subsets just over 3 minutes given the original parameters the questioner gave, ran in Python 2.
Here's the output for seed(0):
INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46
And here's the output for seed(1):
INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47
This was a lot of fun, thanks for the problem.
PS: I decided to try benchmarking the naive brute-force approach:
INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms
Naturally, I interrupted it, as it would not ever get close in my lifetime, perhaps our Sun's lifetime?:
>>> import math
>>> def combinations(n, k):
... return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
...
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L
Here's a solution using itertools.combinations
to iterate over various combinations of the subsets, and union(*x)
to combine them.
import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
found = []
for n in range(2,len(subsets)):
for x in itertools.combinations(subsets, n):
u = set().union(*x)
if A==u:
found.append(x)
if found:
break
return found
print foo(A,subsets)
produces:
[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])),
(set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]
For this example it runs a bit faster than your code, though if I expand it to keep track of the subset names it runs a bit slower. But this is a small example, so timings don't mean much. (edit - as demonstrated in the other answer, this approach slows down considerably with larger problems).
numpy
isn't going help, since we are not dealing with array or parallel operations. As others write it is basically a search problem. You can speed up internal steps, and try to prune away deadends, but you can't get away from trying many alternatives.
The usual way to do searches in numpy
is to construct a matrix of all combinations, and then pull out the desired ones with something like sum, min or max. It's a brute force approach that takes advantage of fast compiled operations on arrays.