Mathematica快速二维分箱算法

2018-06-12 10:01:45

我在Mathematica中开发适当的快速分箱算法时遇到了一些麻烦。我有一个形式为T = {{x1，y1，z1}，{x2，y2，z2}，...}的大型（〜100k个元素）数据集，我想将它组合成二维数组100×100个分箱，分箱值由落入每个分箱的Z值之和给出。

目前我正在遍历表中的每个元素，使用Select根据bin边界列表挑选出应该在哪个bin中，并将z值添加到占用该bin的值列表中。最后，我将Total映射到bin的列表中，总结其内容（我这样做是因为我有时想要做其他事情，比如最大化）。

我尝试过使用Gather和其他这样的功能来做到这一点，但上面的方法可笑得更快，但也许我使用的是收集不好。无论如何，通过我的方法进行排序仍然需要几分钟，我觉得Mathematica可以做得更好。有没有人有一个很好的高效算法得心应手？

这是一个基于Szabolcs的帖子，大约快一个数量级的方法。

data = RandomReal[5, {500000, 3}];
(*500k values*)
zvalues = data[[All, 3]];

epsilon = 1*^-10;(*prevent 101 index*)
(*rescale and round (x,y) coordinates to index pairs in the 1..100 range*)
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]}, 
    SparseArray[
     gb[[All, 1, 1]] -> 
      Total[gb[[All, All, 2]], {2}]]]; // AbsoluteTiming

给出关于{2.012217，Null}

AbsoluteTiming[
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
 res3 = SparseArray[indexes -> zvalues];
 System`SetSystemOptions[ 
  "SparseArrayOptions" -> {"TreatRepeatedEntries" -> 0}];
 ]

给出{0.195228，Null}

res3 == res2
True

“TreatRepeatedEntries” - > 1添加重复职位。

由于Szabolcs的可读性问题，我打算重写下面的代码。在此之前，要知道如果您的垃圾箱经常使用，并且可以使用Round ， Floor或Ceiling （第二个参数）代替Nearest ，下面的代码将更快。在我的系统上，它的测试速度比发布的GatherBy解决方案还要快。

假设我理解你的要求，我建议：

data = RandomReal[100, {75, 3}];

bins = {0, 20, 40, 60, 80, 100};

Reap[
  Sow[{#3, #2}, bins ~Nearest~ #] & @@@ data,
  bins,
  Reap[Sow[#, bins ~Nearest~ #2] & @@@ #2, bins, Tr@#2 &][[2]] &
][[2]] ~Flatten~ 1 ~Total~ {3} // MatrixForm

重构：

f[bins_] := Reap[Sow[{##2}, bins ~Nearest~ #]& @@@ #, bins, #2][[2]] &

bin2D[data_, X_, Y_] := f[X][data, f[Y][#2, #2~Total~2 &] &] ~Flatten~ 1 ~Total~ {3}

使用：

bin2D[data, xbins, ybins]

这是我的方法：

data = RandomReal[5, {500000, 3}]; (* 500k values *)

zvalues = data[[All, 3]];

epsilon = 1*^-10; (* prevent 101 index *)

(* rescale and round (x,y) coordinates to index pairs in the 1..100 range *)    
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];

(* approach 1: create bin-matrix first, then fill up elements by adding  zvalues *)
res1 = Module[
    {result = ConstantArray[0, {100, 100}]},
    Do[
      AddTo[result[[##]], zvalues[[i]]] & @@ indexes[[i]], 
      {i, Length[indexes]}
    ];
    result
    ]; // Timing

(* approach 2: gather zvalues by indexes, add them up, convert them to a matrix *)
res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]},
    SparseArray[gb[[All, 1, 1]] -> (Total /@ gb[[All, All, 2]])]
    ]; // Timing

res1 == res2

这两种方法（ res1 ＆ res2 ）可以分别在本机上处理每秒100k和200k个元素。这是否足够快，或者你需要在循环中运行整个程序？

链接地址: http://www.djcxy.com/p/35521.html

上一篇: Mathematica fast 2D binning algorithm

下一篇: How would you do a PivotTable function in Mathematica?