设置封面或击中设置; Numpy,Least元素组合来弥补整套
我的目标是找到尽可能少的子集[af]来组成整套A.
A = set([1,2,3,4,5,6,7,8,9,10]) # full set
#--- below are sub sets of A ---
a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])
实际上,我正在处理的父集A包含15k个独特元素,其中包含30k个子集,这些子集的范围从单个独特元素到1.5k个独特元素。
截至目前,我正在使用的代码看起来或多或少像下面这样,并且非常慢:
import random
B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)
Dict = {}
for i in Bx: # iterate through shuffled keys.
z = [i]
x = B[i]
L = len(x)
while L < len(A):
for ii in Bx:
x = x | B[ii]
Lx = len(x)
if Lx > L:
L = Lx
z.append(ii)
try:
Dict[len(z)].append(z)
except KeyError:
Dict[len(z)] = [z]
print Dict[min(Dict.keys()]
这只是为了说明我采取的方法。 为了清楚起见,我省略了一些逻辑,这些逻辑最大限度地减少了已经太大的集合的迭代以及其他类似的事情。
我想Numpy真的很擅长这种类型的问题,但我想不出使用它的方式。
问题是要求实现Set Cover Problem,为此找不到最优解。 然而,对问题的贪婪解决方案 - 反复选择包含尚未覆盖的最多元素的子集 - 在合理的时间内做得很好。
在上一个问题中,您可以在python中找到该算法的实现
编辑补充:@Aaron Hall的答案可以通过使用下面的代替他的greedy_set_cover例程来改进。 在Aaron的代码中,每次我们想要将一个子集添加到封面时,我们都会为每个剩余子集计算得分len(s-result_set)
。 但是,这个分数只会随着我们添加到result_set而降低; 所以如果在当前的迭代中我们选出了一个最好的集合,其分数高于之前迭代中获得的剩余子集,我们知道他们的分数不能提高,并且可以忽略它们。 这表明使用优先级队列来存储要处理的子集; 在python中,我们可以用heapq
实现这个想法:
# at top of file
import heapq
#... etc
# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
max = len(parent_set)
# create the initial heap. Note 'subsets' can be unsorted,
# so this is independent of whether remove_redunant_subsets is used.
heap = []
for s in subsets:
# Python's heapq lets you pop the *smallest* value, so we
# want to use max-len(s) as a score, not len(s).
# len(heap) is just proving a unique number to each subset,
# used to tiebreak equal scores.
heapq.heappush(heap, [max-len(s), len(heap), s])
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
best = []
unused = []
while heap:
score, count, s = heapq.heappop(heap)
if not best:
best = [max-len(s - result_set), count, s]
continue
if score >= best[0]:
# because subset scores only get worse as the resultset
# gets bigger, we know that the rest of the heap cannot beat
# the best score. So push the subset back on the heap, and
# stop this iteration.
heapq.heappush(heap, [score, count, s])
break
score = max-len(s - result_set)
if score >= best[0]:
unused.append([score, count, s])
else:
unused.append(best)
best = [score, count, s]
add_set = best[2]
logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
results.append(add_set)
result_set.update(add_set)
# subsets that were not the best get put back on the heap for next time.
while unused:
heapq.heappush(heap, unused.pop())
return results
为了比较,这里是我笔记本电脑上Aaron代码的时间表。 当我们使用堆时,我放弃了remove_redundant_subsets,无论如何,支配的子集不再被重新处理:
INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46
这里是上面的代码的时间; 快3倍多一点。
INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46
注意:这两种算法以不同的顺序处理子集,偶尔会给设置封面的大小给出不同的答案; 这只是在分数相关时才归结为“幸运”的子集选择。
优先级队列/堆是对贪婪算法的众所周知的优化,尽管我找不到这种链接的体面讨论。
虽然贪婪算法是一种快速获得近似答案的方法,但您可以花费时间改进答案,知道我们对最小集合覆盖有一个上限。 这样做的技术包括模拟退火,或如Peter Norvig在本文中所述的分支定界算法
感谢这个问题,我发现它很有趣。 我已经在Python 2.6,2.7和3.3上测试了下面的代码,你可能会发现自己运行它很有趣,我很容易将它粘贴到解释器中或作为脚本运行。
这里的另一个解决方案试图通过蛮力解决,即经历每种可能的组合,这可能适用于十个元素,这是提问者给出的例子,但不会提供对询问者询问的参数的解决方案,从一组30,000组中选择子组合(最多1500个元素,超过15000个元素)。 我发现这些参数,试图找到一个解决方案集,其中n = 40(非常不可能)意味着在一个googol中搜索多个组合的顺序,这是不可能的。
安装程序
在这里我导入一些模块,用于对我的函数进行基准测试并创建数据。 我还创建了一个计时器装饰器来包装这些函数,以便我可以轻松地测量函数完成之前需要多少时间(或者我放弃并中断函数)。
import functools
import time
import logging
import random
# basic setup:
logging.basicConfig(level=logging.DEBUG) # INFO or DEBUG
random.seed(1)
PARENT_SIZE = 15000
MAX_SUBSET_SIZE = 1500
N_SUBSETS = 30000
def timer(f):
'''
timer wrapper modified, original obtained from:
http://stackoverflow.com/questions/5478351/python-time-measure-function
'''
@functools.wraps(f)
def wrap(*args):
time1 = time.time()
try:
ret = f(*args)
except KeyboardInterrupt:
time2 = time.time()
logging.info('{0} function interrupted after {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
else:
time2 = time.time()
logging.info('{0} function took {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
数据创建功能
接下来我必须创建数据:
@timer
def make_subsets(parent_set, n):
'''create list of subset sets, takes about 17 secs'''
subsets = []
for i in range(n): # use xrange in python 2
subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
return subsets
@timer
def include_complement(parent_set, subsets):
'''ensure no missing elements from parent, since collected randomly'''
union_subsets = set().union(*subsets)
subsets_complement = set(parent_set) - union_subsets
logging.info('len of union of all subsets was {0}'.format(
len(union_subsets)))
if subsets_complement:
logging.info('len of subsets_complement was {0}'.format(
len(subsets_complement)))
subsets.append(subsets_complement)
return subsets
可选的预处理
我提供了一些预处理,它在几秒钟内运行,但没有多大帮助,仅加速了几分之一秒,但在这里记录下来供读者参考:
@timer
def remove_redundant_subsets(subsets):
'''
without break, takes a while, removes 81 sets of len <= 4 (seed(0))
in 5.5 minutes, so breaking at len 10 for 4 second completion.
probably unnecessary if truly random subsets
but *may* be good if large subsets are subsets of others.
'''
subsets.sort(key=len)
remove_list = []
for index, s in enumerate(subsets, 1):
if len(s) > 10: # possible gain not worth continuing farther
break
if any(s.issubset(other) for other in subsets[index:]):
logging.debug('will remove subset: {s}'.format(s=s))
remove_list.append(s)
logging.info('subsets removing: {0}'.format(len(remove_list)))
for s in remove_list:
subsets.remove(s)
return subsets
实际功能
然后我实际执行贪婪算法:
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
# maybe room for optimization here: Will still have to calculate.
# But custom max could shortcut subsets on uncovered more than len.
add_set = max(subsets, key=lambda x: len(x - result_set))
logging.debug('len of add_set is {0}'.format(len(add_set)))
results.append(add_set)
result_set.update(add_set)
return results
这是main() :
# full set, use xrange instead of range in python 2 for space efficiency
parent_set = range(PARENT_SIZE)
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
logging.debug('len of set {0} is {1}'.format(i, len(set)))
最终结果
这提供了46分(ish)子集的最终结果,只要超过3分钟,因为提问者给出的原始参数在Python 2中运行。
以下是种子(0)的输出:
INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46
这里是种子(1)的输出:
INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47
这非常有趣,谢谢你的问题。
PS:我决定尝试以天真的暴力方式为基准:
INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms
当然,我打断了它,因为在我有生之年,也许我们的太阳的一生中它永远不会接近?:
>>> import math
>>> def combinations(n, k):
... return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
...
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L
这里有一个解决方案,使用itertools.combinations
遍历子集的各种组合,并使用union(*x)
来组合它们。
import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
found = []
for n in range(2,len(subsets)):
for x in itertools.combinations(subsets, n):
u = set().union(*x)
if A==u:
found.append(x)
if found:
break
return found
print foo(A,subsets)
生产:
[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])),
(set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]
对于这个例子,它运行得比你的代码快一点,但是如果我扩展它以跟踪子集名称,它会运行得慢一点。 但这只是一个小例子,因此时间安排并不意味着太多。 (编辑 - 正如其他答案中所证明的,这种方法会在较大的问题下显着减慢)。
numpy
没有帮助,因为我们没有处理数组或并行操作。 正如其他人写它基本上是一个搜索问题。 你可以加快内部步骤,并试图删除死刑,但你不能逃避尝试多种选择。
在numpy
中进行搜索的常用方法是构造所有组合的矩阵,然后用sum,min或max之类的东西抽出所需的组合。 这是一种利用数组上的快速编译操作的强力方法。
上一篇: Set Cover or Hitting Set; Numpy, Least element combinations to make up full set
下一篇: contents('php://input') returning string for JSON message