np.isnan on arrays of dtype "object"

2018-06-11 12:17:04

I'm working with numpy arrays of different data types. I would like to know, of any particular array, which elements are NaN. Normally, this is what np.isnan is for.

However, np.isnan isn't friendly to arrays of data type object (or any string data type):

>>> str_arr = np.array(["A", "B", "C"])
>>> np.isnan(str_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Not implemented for this type

>>> obj_arr = np.array([1, 2, "A"], dtype=object)
>>> np.isnan(obj_arr)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

What I would like to get out of these two calls is simply np.array([False, False, False]) . I can't just put try and except TypeError around my call to np.isnan and assume that any array that generates a TypeError does not contain NaNs: after all, I'd like np.isnan(np.array([1, np.NaN, "A"])) to return np.array([False, True, False]) .

My current solution is to make a new array, of type np.float64 , loop through the elements of the original array, try ing to put that element in the new array (and if it fails, leave it as zero) and then calling np.isnan on the new array. However, this is of course rather slow. (At least, for large object arrays.)

def isnan(arr):
    if isinstance(arr, np.ndarray) and (arr.dtype == object):
        # Create a new array of dtype float64, fill it with the same values as the input array (where possible), and
        # then call np.isnan on the new array. This way, np.isnan is only called once. (Much faster than calling it on
        # every element in the input array.)
        new_arr = np.zeros((len(arr),), dtype=np.float64)
        for idx in xrange(len(arr)):
            try:
                new_arr[idx] = arr[idx]
            except Exception:
                pass
        return np.isnan(new_arr)
    else:
        try:
            return np.isnan(arr)
        except TypeError:
            return False

This particular implementation also only works for one-dimensional arrays, and I can't think of a decent way to make the for loop run over an arbitrary number of dimensions.

Is there a more efficient way to figure out which elements in an object -type array are NaN?

EDIT: I'm running Python 2.7.10.

Note that [x is np.nan for x in np.array([np.nan])] returns False : np.nan is not always the same object in memory as a different np.nan .

I do not want the string "nan" to be considered equivalent to np.nan : I want isnan(np.array(["nan"], dtype=object)) to return np.array([False]) .

The multi-dimensionality isn't a big issue. (It's nothing that a little ravel -and- reshape ing won't fix. :p)

Any function that relies on the is operator to test equivalence of two NaNs isn't always going to work. (If you think they should, ask yourself what the is operator actually does!)

You could just use a list comp to get the indexes of any nan's which may be faster in this case:

obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

inds = [i for i,n in enumerate(obj_arr) if str(n) == "nan"]

Or if you want a boolean mask:

mask = [True if str(n) == "nan" else False for n in obj_arr]

Using is np.nan also seems to work without needing to cast to str:

In [29]: obj_arr = np.array([1, 2, np.nan, "A"], dtype=object)

In [30]: [x is np.nan for x in obj_arr]
Out[30]: [False, False, True, False]

For flat and multidimensional arrays you could check the shape:

def masks(a):
    if len(a.shape) > 1:
        return [[x is np.nan for x in sub] for sub in a]
    return [x is np.nan for x in a]

If is np.nan can fail maybe check the type then us np.isnan

def masks(a):
    if len(a.shape) > 1:
        return [[isinstance(x, float) and np.isnan(x) for x in sub] for sub in arr]
    return [isinstance(x, float) and np.isnan(x)  for x in arr]

Interestingly x is np.nan seems to work fine when the data type is object:

In [76]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [77]: [x is np.nan  for x in arr]
Out[77]: [True, True, False]

In [78]: arr = np.array([np.nan,np.nan,"3"])

In [79]: [x is np.nan  for x in arr]
Out[79]: [False, False, False]

depending on the dtype different things happen:

In [90]: arr = np.array([np.nan,np.nan,"3"])

In [91]: arr.dtype
Out[91]: dtype('S32')

In [92]: arr
Out[92]: 
array(['nan', 'nan', '3'], 
      dtype='|S32')

In [93]: [x == "nan"  for x in arr]
Out[93]: [True, True, False]

In [94]: arr = np.array([np.nan,np.nan,"3"],dtype=object)

In [95]: arr.dtype
Out[95]: dtype('O')

In [96]: arr
Out[96]: array([nan, nan, '3'], dtype=object)

In [97]: [x == "nan"  for x in arr]
Out[97]: [False, False, False]

Obviously the nan's get coerced to numpy.string_'s when you have strings in your array so x == "nan" works in that case, when you pass object the type is float so if you are always using object dtype then the behaviour should be consistent.

Define a couple of test arrays, small and bigger

In [21]: x=np.array([1,23.3, np.nan, 'str'],dtype=object)
In [22]: xb=np.tile(x,300)

Your function:

In [23]: isnan(x)
Out[23]: array([False, False,  True, False], dtype=bool)

The straight forward list comprehension, returning an array

In [24]: np.array([i is np.nan for i in x])
Out[24]: array([False, False,  True, False], dtype=bool)

np.frompyfunc has similar vectorizing power to np.vectorize , but for some reason is under utilized (and in my experience faster)

In [25]: def myisnan(x):
        return x is np.nan
In [26]: visnan=np.frompyfunc(myisnan,1,1)

In [27]: visnan(x)
Out[27]: array([False, False, True, False], dtype=object)

Since it returns dtype object, we may want to cast its values:

In [28]: visnan(x).astype(bool)
Out[28]: array([False, False,  True, False], dtype=bool)

It can handle multidim arrays nicely:

In [29]: visnan(x.reshape(2,2)).astype(bool)
Out[29]: 
array([[False, False],
       [ True, False]], dtype=bool)

Now for some timings:

In [30]: timeit isnan(xb)
1000 loops, best of 3: 1.03 ms per loop

In [31]: timeit np.array([i is np.nan for i in xb])
1000 loops, best of 3: 393 us per loop

In [32]: timeit visnan(xb).astype(bool)
1000 loops, best of 3: 382 us per loop

An important point with the i is np.nan test - it only applies to scalars. If the array is dtype object, then iteration produces scalars. But for array of dtype float, we get values of numpy.float64 . For those the np.isnan(i) is the correct test.

In [61]: [(i is np.nan) for i in np.array([np.nan,np.nan,1.3])]
Out[61]: [False, False, False]

In [62]: [np.isnan(i) for i in np.array([np.nan,np.nan,1.3])]
Out[62]: [True, True, False]

In [63]: [(i is np.nan) for i in np.array([np.nan,np.nan,1.3], dtype=object)]
Out[63]: [True, True, False]

In [64]: [np.isnan(i) for i in np.array([np.nan,np.nan,1.3],  dtype=object)]
...
TypeError: Not implemented for this type

I would use np.vectorize and a custom function that tests for nan elementwise. So,

def _isnan(x):
    if isinstance(x, type(np.nan)):
        return np.isnan(x)
    else:
        return False

my_isnan = np.vectorize(_isnan)

Then

X = np.array([[1, 2, np.nan, "A"], [np.nan, True, [], ""]], dtype=object)
my_isnan(X)

returns

 array([[False, False,  True, False],
        [ True, False, False, False]], dtype=bool)

链接地址: http://www.djcxy.com/p/33010.html

上一篇: 如何使用Node.js解决“无法找到模块”错误？

下一篇: 在dtype“object”数组上的np.isnan