masking a series with a boolean array
This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
delta mask creates a boolean pandas series.
However, if you do
x[deltamask]
y[deltamask]
You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like
x[deltamask]*y[deltamask]
results in an error:
print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
Even more perplexing, I noticed that the operator < is treated differently. For instance
print type(2*x < x*y)
print type(2 < x*y)
will give you a pd.series and np.array respectively.
Also,
5 < x - y
results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.
What is the reason for this?
Fancy Indexing
As numpy currently stands, fancy indexing in numpy works as follows:
If the thing between brackets is a tuple
(whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x
. For example, both x[(True, True)]
and x[True, True]
will raise IndexError: too many indices for array
in this case because x
is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future
.
If the thing between brackets is exactly an ndarray
, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values]
gives the expected result (empty array since deltamask
is all False
.
If the thing between brackets is any array-like, whether a subclass like Series
or just a list
, or something else, it is converted to an np.intp
array (if possible) and used as an integer index. So x[deltamask]
yeilds something equivalent to x[[False] * 7]
or just x[[0] * 7]
. In this case, len(deltamask)==7
and x[0]==1
so the result is [1, 1, 1, 1, 1, 1, 1]
.
This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
Relational Operators
Now let's address the second part of your question about how the comparison works. Relational operators ( <
, >
, <=
, >=
) work by calling the corresponding method on one of the objects being compared. For <
this is __lt__
. However, instead of just calling x.__lt__(y)
for the expression x < y
, Python actually checks the types of the objects being compared. If y
is a subtype of x
that implements the comparison, then Python prefers to call y.__gt__(x)
instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y)
will get called if y
is a subclass of x
is if y.__gt__(x)
returns NotImplemented
to indicate that the comparison is not supported in that direction.
A similar thing happens when you do 5 < x - y
. While ndarray
is not a subclass of int
, the comparison int.__lt__(ndarray)
returns NotImplemented
, so Python actually ends up calling (x - y).__gt__(5)
, which is of course defined and works just fine.
A much more succinct explanation of all this can be found in the Python docs.
链接地址: http://www.djcxy.com/p/36058.html上一篇: 为什么在IDE中调试更好?
下一篇: 用布尔数组掩盖一系列