熊猫系列矢量化查找到字典
问题陈述:
熊猫数据帧列系列中, same_group
需要根据两个现有列( row
和col
的值从布尔值创建。 如果行中的两个单元格在字典memberships
具有相似的值(相交值),则该行需要显示True,否则为False(不相交的值)。 我如何以矢量化的方式做到这一点(不使用apply)?
建立:
import pandas as pd
import numpy as np
n = np.nan
memberships = {
'a':['vowel'],
'b':['consonant'],
'c':['consonant'],
'd':['consonant'],
'e':['vowel'],
'y':['consonant', 'vowel']
}
congruent = pd.DataFrame.from_dict(
{'row': ['a','b','c','d','e','y'],
'a': [ n, -.8,-.6,-.3, .8, .01],
'b': [-.8, n, .5, .7,-.9, .01],
'c': [-.6, .5, n, .3, .1, .01],
'd': [-.3, .7, .3, n, .2, .01],
'e': [ .8,-.9, .1, .2, n, .01],
'y': [ .01, .01, .01, .01, .01, n],
}).set_index('row')
congruent.columns.names = ['col']
cs = congruent.stack().to_frame()
cs.columns = ['score']
cs.reset_index(inplace=True)
cs.head(6)
期望的目标:
我如何完成基于字典上的查找来创建这个新列?
请注意,我试图找到交集,而不是等价。 例如,第4行应该有一个1的same_group
,因为a
和y
都是元音(尽管y
是“有时是元音”,因此属于组元音和元音)。
# create a series to make it convenient to map
# make each member a set so I can intersect later
lkp = pd.Series(memberships).apply(set)
# get number of rows and columns
# map the sets to column and row indices
n, m = congruent.shape
c = congruent.columns.to_series().map(lkp).values
r = congruent.index.to_series().map(lkp).values
print(c)
[{'vowel'} {'consonant'} {'consonant'} {'consonant'} {'vowel'}
{'consonant', 'vowel'}]
print(r)
[{'vowel'} {'consonant'} {'consonant'} {'consonant'} {'vowel'}
{'consonant', 'vowel'}]
# use np.repeat, np.tile, zip to create cartesian product
# this should match index after stacking
# apply set intersection for each pair
# empty sets are False, otherwise True
same = [
bool(set.intersection(*tup))
for tup in zip(np.repeat(r, m), np.tile(c, n))
]
# use dropna=False to ensure we maintain the
# cartesian product I was expecting
# then slice with boolean list I created
# and dropna
congruent.stack(dropna=False)[same].dropna()
row col
a e 0.80
y 0.01
b c 0.50
d 0.70
y 0.01
c b 0.50
d 0.30
y 0.01
d b 0.70
c 0.30
y 0.01
e a 0.80
y 0.01
y a 0.01
b 0.01
c 0.01
d 0.01
e 0.01
dtype: float64
产生想要的结果
congruent.stack(dropna=False).reset_index(name='Score')
.assign(same_group=np.array(same).astype(int)).dropna()
想法:让我们将['vowel', 'consonant']
列表转换为二进制[1, 2]
并使用按位运算:
建立:
In [138]: lkp2 = pd.Series(memberships)
.apply(pd.Series)
.replace({'vowel':1, 'consonant':2})
.sum(1)
.astype('uint8')
In [139]: lkp2
Out[139]:
a 1 # 'vovel'
b 2 # 'consonant'
c 2 # 'consonant'
d 2 # 'consonant'
e 1 # 'vovel'
y 3 # 1 | 2 = 3 - both bits are set
dtype: uint8
解:
In [140]: cs['same_group'] = np.bitwise_and(cs.row.map(lkp2), cs.col.map(lkp2)).ne(0).mul(1)
In [141]: cs
Out[141]:
row col score same_group
0 a b -0.80 0
1 a c -0.60 0
2 a d -0.30 0
3 a e 0.80 1
4 a y 0.01 1
5 b a -0.80 0
6 b c 0.50 1
7 b d 0.70 1
8 b e -0.90 0
9 b y 0.01 1
10 c a -0.60 0
11 c b 0.50 1
12 c d 0.30 1
13 c e 0.10 0
14 c y 0.01 1
15 d a -0.30 0
16 d b 0.70 1
17 d c 0.30 1
18 d e 0.20 0
19 d y 0.01 1
20 e a 0.80 1
21 e b -0.90 0
22 e c 0.10 0
23 e d 0.20 0
24 e y 0.01 1
25 y a 0.01 1
26 y b 0.01 1
27 y c 0.01 1
28 y d 0.01 1
29 y e 0.01 1
30 y y 0.00 1
时间:对3.1M行DF:
In [180]: cs = pd.concat([cs] * 10**5, ignore_index=True)
In [181]: cs.shape
Out[181]: (3100000, 3)
In [182]: %timeit np.bitwise_and(cs.row.map(lkp2), cs.col.map(lkp2)).ne(0).mul(1)
1 loop, best of 3: 466 ms per loop
这是一个尝试:
我使用iloc切片数据。 成员资格[my_slice]为我提供字典成员资格和int(成员资格[my_slice])中的值,如果为true,则返回1,如果为false,则返回0,以便在新列same_group中直接追加该数字。
grp = []
for i in range(len(cs)):
if 'y' not in cs.iloc[i,[0,1]][0|1]:
grp.append(int(memberships[cs.iloc[i,[0,1]][0]] == memberships[cs.iloc[i,[0,1]][1]]))
i+=1
else:
grp.append(int(memberships[cs.iloc[i,[0,1]][0|1]] == memberships[cs.iloc[i,[0,1]][0|1]]))
i+=1
same_grp = pd.Series(grp)
cs = pd.concat([cs, same_grp], axis = 1)
cs.columns = ['row', 'col', 'score', 'same_grp']
cs.head(10)
我用条件(如果'y'不在cs.iloc [i,[0,1]] [0 | 1]中)来处理y可以有两个值的事实。
链接地址: http://www.djcxy.com/p/37925.html上一篇: Vectorized Lookups of Pandas Series to a Dictionary
下一篇: Using new on a function doesn't work unless merged with an interface. Why?