Python快速计算大量距离

2018-07-02 09:12:53

我有36,742点的输入，这意味着如果我想计算距离矩阵的下三角（使用vincenty近似），我需要产生36,742 * 36,741 * 0.5 = 1,349,974,563的距离。

我想保持彼此距离在50公里以内的配对组合。我目前的设置如下

shops= [[id,lat,lon]...]

def lower_triangle_mat(points):
    for i in range(len(shops)-1):
        for j in range(i+1,len(shops)):
            yield [shops[i],shops[j]]

def return_stores_cutoff(points,cutoff_km=0):
    below_cut = []
    counter = 0
    for x in lower_triangle_mat(points):
        dist_km = vincenty(x[0][1:3],x[1][1:3]).km
        counter += 1
        if counter % 1000000 == 0:
            print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5)))
        if dist_km <= cutoff_km:
            below_cut.append([x[0][0],x[1][0],dist_km])
    return below_cut

start = time.clock()
stores = return_stores_cutoff(points=shops,cutoff_km=50)
print(time.clock() - start)

这显然需要几个小时。我想到的一些可能性：

使用numpy来引导这些计算而不是循环

使用某种散列来快速切断（100公里范围内的所有商店），然后只计算这些商店之间的准确距离

不是将点存储在列表中，而是使用类似于四叉树的东西，但我认为这只能帮助排名靠近点而不是实际距离 - >所以我猜想某种地理数据库

我明显可以尝试使用海星或项目并使用欧几里得距离，但是我有兴趣使用最准确的测量

利用并行处理（但是，如何减少列表以获得所有相关的对），我有点困难。

编辑：我认为geohashing肯定是需要在这里 - 一个例子来自：

from geoindex import GeoGridIndex, GeoPoint

geo_index = GeoGridIndex()
for _ in range(10000):
    lat = random.random()*180 - 90
    lng = random.random()*360 - 180
    index.add_point(GeoPoint(lat, lng))

center_point = GeoPoint(37.7772448, -122.3955118)
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
    print("We found {0} in {1} km".format(point, distance))

但是，我还想引导（而不是循环）由地理散列返回的商店的距离计算。

编辑2：Pouria Hadjibagheri - 我尝试使用lambda和地图：

# [B]: Mapping approach           
lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops)))

func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km)
# Trying to see if conditional statements slow this down
func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None

start = time.clock()
out_dist = list(map(func,lwr_tr_mat))
print(time.clock() - start)

start = time.clock()
out_dist = list(map(func_cond,lwr_tr_mat))
print(time.clock() - start)

他们都在61秒左右 （我限制了从32,000到2000的商店数量）。也许我错误地使用了地图？

这听起来像是kD树的经典用例。

如果你首先将你的点转换成欧几里德空间，那么你可以使用scipy.spatial.cKDTree的query_pairs方法：

from scipy.spatial import cKDTree

tree = cKDTree(data)
# where data is (nshops, ndim) containing the Euclidean coordinates of each shop
# in units of km

pairs = tree.query_pairs(50, p=2)   # 50km radius, L2 (Euclidean) norm

pairs将是一set (i, j)元组，对应于彼此≤50km的商店对的行索引。

tree.sparse_distance_matrix的输出是一个scipy.sparse.dok_matrix 。由于矩阵将是对称的，并且您只对唯一的行/列对感兴趣，因此可以使用scipy.sparse.tril将上三角形归零，从而为您提供scipy.sparse.coo_matrix 。从那里，您可以通过.row ， .col和.data属性访问非零的行和列索引及其相应的距离值：

from scipy import sparse

tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
ridx = udist.row    # row indices
cidx = udist.col    # column indices
dist = udist.data   # distance values

你试过映射整个数组和函数而不是迭代它们吗？一个例子如下：

from numpy.random import rand

my_array = rand(int(5e7), 1)  # An array of 50,000,000 random numbers in double.

现在通常做的是：

squared_list_iter = [value**2 for value in my_array]

当然有效，但是最佳无效。

另一种方法是使用函数映射数组。这如下完成：

func = lambda x: x**2  # Here is what I want to do on my array.

squared_list_map = map(func, test)  # Here I am doing it!

现在，有人会问，这有什么不同，甚至更好？既然现在我们也添加了一个函数调用！这是你的答案：

对于以前的解决方案（通过迭代）：

1 loop: 1.11 minutes.

与后者解决方案（映射）相比：

500 loop, on average 560 ns.

同时将map()转换为list(map(my_list))将使时间增加10倍至约500 ms 。

你选！

“使用某种哈希来快速切断（100公里范围内的所有商店），然后只计算这些商店之间的准确距离。”我认为这可能更好地称为网格化。因此，先以一系列关键词作为字典，然后将每家商店放在距离该地点50公里的地方。那么当你计算距离时，你只能看看附近的桶，而不是遍历整个宇宙中的每个商店

链接地址: http://www.djcxy.com/p/90273.html

上一篇: Python calculate lots of distances quickly

下一篇: How to set 'use strict' globally with JSLint