熊猫,groupby,并在团体中找到最大值,返回值和数量
我有一个带有日志数据的熊猫DataFrame:
host service
0 this.com mail
1 this.com mail
2 this.com web
3 that.com mail
4 other.net mail
5 other.net web
6 other.net web
我想在每个出现最多错误的主机上找到该服务:
host service no
0 this.com mail 2
1 that.com mail 1
2 other.net web 2
我发现的唯一解决方案是按主机和服务进行分组,然后迭代索引的0级。
任何人都可以提出更好,更短的版本? 没有迭代?
df = df_logfile.groupby(['host','service']).agg({'service':np.size})
df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service'] = np.nan
df_count['no'] = np.nan
for h,data in df.groupby(level=0):
i = data.idxmax()[0]
service = i[1]
no = data.xs(i)[0]
df_count.loc[df_count['host'] == h, 'service'] = service
df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no'] = no
完整的代码https://gist.github.com/bjelline/d8066de66e305887b714
鉴于df
,下一步是按host
价值单独进行分组
按idxmax
。 这给你提供了对应于最大服务价值的索引。 然后,您可以使用df.loc[...]
选择与最大服务值对应的df
中的行:
import numpy as np
import pandas as pd
df_logfile = pd.DataFrame({
'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net',
'other.net', 'other.net'],
'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })
df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("nOutputn{}".format(df_count))
产生DataFrame
host service no
0 other.net web 2
1 that.com mail 1
2 this.com mail 2
链接地址: http://www.djcxy.com/p/82203.html
上一篇: Pandas, groupby and finding maximum in groups, returning value and count