doing better than numpy's in1d mask function: ordered arrays?

This operation needs to be applied as fast as possible as the actual arrays which contain millions of elements. This is a simple version of the problem.

So, I have a random array of unique integers (normally millions of elements).

totalIDs = [5,4,3,1,2,9,7,6,8 ...]

I have another array (normally a tens of thousands) of unique integers which I can create a mask.

subsampleIDs1 = [5,1,9]
subsampleIDs2 = [3,7,8]
subsampleIDs3 = [2,6,9]
...

I can use numpy to do

mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)

I can then extract the information I want of another array using the mask (say column 0 contains the one I want).

variable = allvariables[mask][:,0]

Now given that the IDs are unique in both arrays, is there any way to speed this up significantly. It takes a long time to construct the mask for a few thousand points (subsampleIDs) matching against millions of IDs (totalIDs).

I thought of going through it once and writing out a binary file of an index (to speed up future searches).

for i in range(0,3):
    mask = np.in1d(totalIDs,subsampleIDs,assume_unique=True)
    index[mask] = i

where X is in subsampleIDsX. Then I can just do:

for i in range(0,3):
    if index[i] == i:
        rowmatch = i
        break

variable = allvariables[rowmatch:len(subsampleIDs),0]

right? But this is also slow because there is a conditional in the loop to find when it first matches. Is there a faster way to find when a number first appears in an ordered array so the conditional doesn't slow the loop?


I suggest you use DataFrame in Pandas. the index of the DataFrame is the totalIDs, and you can select subsampleIDs by: df.ix[subsampleIDs] .

Create some test data first:

import numpy as np
N = 2000000
M = 5000
totalIDs = np.random.randint(0, 10000000, N)
totalIDs = np.unique(totalIDs)
np.random.shuffle(totalIDs)
v1 = np.random.rand(len(totalIDs))
v2 = np.random.rand(len(totalIDs))

subsampleIDs = np.random.choice(totalIDs, M)
subsampleIDs = np.unique(subsampleIDs)
np.random.shuffle(subsampleIDs)

Then convert you data in to a DataFrame:

import pandas as pd
df = pd.DataFrame(data = {"v1":v1, "v2":v2}, index=totalIDs) 
df.ix[subsampleIDs]

DataFrame use a hashtable to map the index to it's location, it's very fast.


Often this kind of indexing is best performed using a DB (with proper column-indexing).

Another idea is to sort totalIDs once, as a preprocessing stage, and implement your own version of in1d , which avoids sorting everything. The numpy implementation of in1d (at least in the version that I have installed) is fairly simple, and should be easy to copy and modify.

EDIT:

Or, even better, use bucket sort (or radix sort). That should give you O(N+M), N being the size of totalIDs , and M the size of sampleIDs (times a constant you can play with by changing the number of buckets). Here too, you can split totalIDs to buckets only once, which gives you a nifty O(N+M1+M2+...).

Unfortunately, I'm not aware of a numpy implementation, but I did find this: http://en.wikipedia.org/wiki/Radix_sort#Example_in_Python

链接地址: http://www.djcxy.com/p/69922.html

上一篇: 从托管C ++到C#的结构列表

下一篇: 比numpy的in1d掩码函数做得更好:有序数组?