Mathematica fast 2D binning algorithm
I am having some trouble developing a suitably fast binning algorithm in Mathematica. I have a large (~100k elements) data set of the form T={{x1,y1,z1},{x2,y2,z2},....} and I want to bin it into a 2D array of around 100x100 bins, with the bin value being given by the sum of the Z values that fall into each bin.
Currently I am iterating through each element of the table, using Select to pick out which bin it is supposed to be in based on lists of bin boundaries, and adding the z value to a list of values occupying that bin. At the end I map Total onto the list of bins, summing their contents (I do this because I sometimes want to do other things, like maximize).
I have tried using Gather and other such functions to do this but the above method was ridiculously faster, though perhaps I am using Gather poorly. Anyway It still takes a few minutes to do the sorting by my method and I feel like Mathematica can do better. Does anyone have a nice efficient algorithm handy?
Here is a method based on Szabolcs's post that is about about an order of magnitude faster.
data = RandomReal[5, {500000, 3}];
(*500k values*)
zvalues = data[[All, 3]];
epsilon = 1*^-10;(*prevent 101 index*)
(*rescale and round (x,y) coordinates to index pairs in the 1..100 range*)
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];
res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]},
SparseArray[
gb[[All, 1, 1]] ->
Total[gb[[All, All, 2]], {2}]]]; // AbsoluteTiming
Gives about {2.012217, Null}
AbsoluteTiming[
System`SetSystemOptions[
"SparseArrayOptions" -> {"TreatRepeatedEntries" -> 1}];
res3 = SparseArray[indexes -> zvalues];
System`SetSystemOptions[
"SparseArrayOptions" -> {"TreatRepeatedEntries" -> 0}];
]
Gives about {0.195228, Null}
res3 == res2
True
"TreatRepeatedEntries" -> 1 adds duplicate positions up.
I intend to do a rewrite of the code below because of Szabolcs' readability concerns. Until then, know that if your bins are regular, and you can use Round
, Floor
, or Ceiling
(with a second argument) in place of Nearest
, the code below will be much faster. On my system, it tests faster than the GatherBy
solution also posted.
Assuming I understand your requirements, I propose:
data = RandomReal[100, {75, 3}];
bins = {0, 20, 40, 60, 80, 100};
Reap[
Sow[{#3, #2}, bins ~Nearest~ #] & @@@ data,
bins,
Reap[Sow[#, bins ~Nearest~ #2] & @@@ #2, bins, Tr@#2 &][[2]] &
][[2]] ~Flatten~ 1 ~Total~ {3} // MatrixForm
Refactored:
f[bins_] := Reap[Sow[{##2}, bins ~Nearest~ #]& @@@ #, bins, #2][[2]] &
bin2D[data_, X_, Y_] := f[X][data, f[Y][#2, #2~Total~2 &] &] ~Flatten~ 1 ~Total~ {3}
Use:
bin2D[data, xbins, ybins]
Here's my approach:
data = RandomReal[5, {500000, 3}]; (* 500k values *)
zvalues = data[[All, 3]];
epsilon = 1*^-10; (* prevent 101 index *)
(* rescale and round (x,y) coordinates to index pairs in the 1..100 range *)
indexes = 1 + Floor[(1 - epsilon) 100 Rescale[data[[All, {1, 2}]]]];
(* approach 1: create bin-matrix first, then fill up elements by adding zvalues *)
res1 = Module[
{result = ConstantArray[0, {100, 100}]},
Do[
AddTo[result[[##]], zvalues[[i]]] & @@ indexes[[i]],
{i, Length[indexes]}
];
result
]; // Timing
(* approach 2: gather zvalues by indexes, add them up, convert them to a matrix *)
res2 = Module[{gb = GatherBy[Transpose[{indexes, zvalues}], First]},
SparseArray[gb[[All, 1, 1]] -> (Total /@ gb[[All, All, 2]])]
]; // Timing
res1 == res2
These two approaches ( res1
& res2
) can handle 100k and 200k elements per second, respectively, on this machine. Is this sufficiently fast, or do you need to run this whole program in a loop?
上一篇: 在Mathematica中实现四叉树
下一篇: Mathematica快速二维分箱算法