np.isnan on arrays of dtype "object"
I'm working with numpy arrays of different data types. I would like to know, of any particular array, which elements are NaN. Normally, this is what np.isnan
is for.
However, np.isnan
isn't friendly to arrays of data type object
(or any string data type):
>>> str_arr = np.array(["A", "B", "C"])
>>> np.isnan(str_arr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type
>>> obj_arr = np.array([1, 2, "A"], dtype=object)
>>> np.isnan(obj_arr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
What I would like to get out of these two calls is simply np.array([False, False, False])
. I can't just put try
and except TypeError
around my call to np.isnan
and assume that any array that generates a TypeError
does not contain NaNs: after all, I'd like np.isnan(np.array([1, np.NaN, "A"]))
to return np.array([False, True, False])
.
My current solution is to make a new array, of type np.float64
, loop through the elements of the original array, try
ing to put that element in the new array (and if it fails, leave it as zero) and then calling np.isnan
on the new array. However, this is of course rather slow. (At least, for large object arrays.)
def isnan(arr):
if isinstance(arr, np.ndarray) and (arr.dtype == object):
# Create a new array of dtype float64, fill it with the same values as the input array (where possible), and
# then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on
# every element in the input array.)
new_arr = np.zeros((len(arr),), dtype=np.float64)
for idx in xrange(len(arr)):
try:
new_arr[idx] = arr[idx]
except Exception:
pass
return np.isnan(new_arr)
else:
try:
return np.isnan(arr)
except TypeError:
return False
This particular implementation also only works for one-dimensional arrays, and I can't think of a decent way to make the for
loop run over an arbitrary number of dimensions.
Is there a more efficient way to figure out which elements in an object
-type array are NaN?
EDIT: I'm running Python 2.7.10.
Note that [x is np.nan for x in np.array([np.nan])]
returns False
: np.nan
is not always the same object in memory as a different np.nan
.
I do not want the string "nan"
to be considered equivalent to np.nan
: I want isnan(np.array(["nan"], dtype=object))
to return np.array([False])
.
The multi-dimensionality isn't a big issue. (It's nothing that a little ravel
-and- reshape
ing won't fix. :p)
Any function that relies on the is
operator to test equivalence of two NaNs isn't always going to work. (If you think they should, ask yourself what the is
operator actually does!)
You could just use a list comp to get the indexes of any nan's which may be faster in this case:
obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)
inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"]
Or if you want a boolean mask:
mask = [True if str(n) == "nan" else False for n in obj_arr]
Using is np.nan
also seems to work without needing to cast to str:
In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)
In [30]: [x is np.nan for x in obj_arr]
Out[30]: [False, False, True, False]
For flat and multidimensional arrays you could check the shape:
def masks(a):
if len(a.shape) > 1:
return [[x is np.nan for x in sub] for sub in a]
return [x is np.nan for x in a]
If is np.nan can fail maybe check the type then us np.isnan
def masks(a):
if len(a.shape) > 1:
return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr]
return [isinstance(x, float) and np.isnan(x) for x in arr]
Interestingly x is np.nan
seems to work fine when the data type is object:
In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object)
In [77]: [x is np.nan for x in arr]
Out[77]: [True, True, False]
In [78]: arr = np.array([np.nan,np.nan,"3"])
In [79]: [x is np.nan for x in arr]
Out[79]: [False, False, False]
depending on the dtype different things happen:
In [90]: arr = np.array([np.nan,np.nan,"3"])
In [91]: arr.dtype
Out[91]: dtype('S32')
In [92]: arr
Out[92]:
array(['nan', 'nan', '3'],
dtype='|S32')
In [93]: [x == "nan" for x in arr]
Out[93]: [True, True, False]
In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object)
In [95]: arr.dtype
Out[95]: dtype('O')
In [96]: arr
Out[96]: array([nan, nan, '3'], dtype=object)
In [97]: [x == "nan" for x in arr]
Out[97]: [False, False, False]
Obviously the nan's get coerced to numpy.string_'s
when you have strings in your array so x == "nan"
works in that case, when you pass object the type is float so if you are always using object dtype then the behaviour should be consistent.
Define a couple of test arrays, small and bigger
In [21]: x=np.array([1,23.3, np.nan, 'str'],dtype=object)
In [22]: xb=np.tile(x,300)
Your function:
In [23]: isnan(x)
Out[23]: array([False, False, True, False], dtype=bool)
The straight forward list comprehension, returning an array
In [24]: np.array([i is np.nan for i in x])
Out[24]: array([False, False, True, False], dtype=bool)
np.frompyfunc
has similar vectorizing power to np.vectorize
, but for some reason is under utilized (and in my experience faster)
In [25]: def myisnan(x):
return x is np.nan
In [26]: visnan=np.frompyfunc(myisnan,1,1)
In [27]: visnan(x)
Out[27]: array([False, False, True, False], dtype=object)
Since it returns dtype object, we may want to cast its values:
In [28]: visnan(x).astype(bool)
Out[28]: array([False, False, True, False], dtype=bool)
It can handle multidim arrays nicely:
In [29]: visnan(x.reshape(2,2)).astype(bool)
Out[29]:
array([[False, False],
[ True, False]], dtype=bool)
Now for some timings:
In [30]: timeit isnan(xb)
1000 loops, best of 3: 1.03 ms per loop
In [31]: timeit np.array([i is np.nan for i in xb])
1000 loops, best of 3: 393 us per loop
In [32]: timeit visnan(xb).astype(bool)
1000 loops, best of 3: 382 us per loop
An important point with the i is np.nan
test - it only applies to scalars. If the array is dtype object, then iteration produces scalars. But for array of dtype float, we get values of numpy.float64
. For those the np.isnan(i)
is the correct test.
In [61]: [(i is np.nan) for i in np.array([np.nan,np.nan,1.3])]
Out[61]: [False, False, False]
In [62]: [np.isnan(i) for i in np.array([np.nan,np.nan,1.3])]
Out[62]: [True, True, False]
In [63]: [(i is np.nan) for i in np.array([np.nan,np.nan,1.3], dtype=object)]
Out[63]: [True, True, False]
In [64]: [np.isnan(i) for i in np.array([np.nan,np.nan,1.3], dtype=object)]
...
TypeError: Not implemented for this type
I would use np.vectorize
and a custom function that tests for nan elementwise. So,
def _isnan(x):
if isinstance(x, type(np.nan)):
return np.isnan(x)
else:
return False
my_isnan = np.vectorize(_isnan)
Then
X = np.array([[1, 2, np.nan, "A"], [np.nan, True, [], ""]], dtype=object)
my_isnan(X)
returns
array([[False, False, True, False],
[ True, False, False, False]], dtype=bool)
链接地址: http://www.djcxy.com/p/33010.html