将熊猫DataFrame从一天开始的n天分组

2018-06-12 23:11:24

我刚刚发现了熊猫的力量，我喜欢它，但我无法弄清楚这个问题：

我有一个DataFrame df.head() ：

   lon   lat  h  filename                  time
0  19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1  19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2  19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3  19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
4  19.570708  80.588183     0.162943     60048 2002-05-15 12:59:32.888467

我想将我的数据分成九天的时间间隔

gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))

第一组：

2002-05-15 12:59:31.717467       lon   lat  h filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

下一组：

2002-05-24 12:59:31.717467        lon   lat  height  filename                  time
815   18.309498  80.457024     0.187387     60309 2002-05-24 16:35:39.553563
816   18.291458  80.458514     0.061446     60309 2002-05-24 16:35:39.610563
817   18.273408  80.460014     0.129255     60309 2002-05-24 16:35:39.667563
818   18.255358  80.461504     0.046761     60309 2002-05-24 16:35:39.724563
...

因此，数据从第一次（12：59：31.717467）算起的九天内分组，而不是按照我的意愿从一天开始。

当分组一天时：

gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))

给我：

2002-05-15 00:00:00       lon   lat  h  filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

我可以循环几天，直到我有九天的时间间隔，但我认为它可以做得更聪明，我正在寻找相当于YS（年初）的Grouper freq选项，只需几天，一种设置开始的方式时间（也许通过Grouper选项convention : {'start', 'end', 'e', 's'} ）或？

我运行的是Python 3.5.2，Pandas的版本是0.19.0

第一次放行：

最好的办法是将datetime时间列的第一行normalize ，以便时间根据9D间隔重置为00:00:00 （午夜）和组：

df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0  19.961216  80.617627 -0.077165     60048 2002-05-15 00:00:00.000000
# 1  19.923916  80.614847 -0.018689     60048 2002-05-15 12:59:31.831467
# 2  19.849396  80.609257 -0.089205     60048 2002-05-15 12:59:32.059467
# 3  19.830776  80.607857  0.076485     60048 2002-05-15 12:59:32.116467
# 4  19.570708  80.588183  0.162943     60048 2002-05-15 12:59:32.888467
# ......................................................................

这会在其他行中恢复时间，因此您不会丢失该信息。

保持第一行：

如果您想保留第一行，不做任何更改，但只想从午夜开始分组，则可以这样做：

df_t_shift = df.shift()    # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)  

for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0        NaN        NaN       NaN       NaN 2002-05-15 00:00:00.000000
# 1  19.961216  80.617627 -0.077165   60048.0 2002-05-15 12:59:31.717467
# 2  19.923916  80.614847 -0.018689   60048.0 2002-05-15 12:59:31.831467
# 3  19.849396  80.609257 -0.089205   60048.0 2002-05-15 12:59:32.059467
# 4  19.830776  80.607857  0.076485   60048.0 2002-05-15 12:59:32.116467
# 5  19.570708  80.588183  0.162943   60048.0 2002-05-15 12:59:32.888467

完成@mfitzp答案你可以这样做：

df['dateonly'] = df['time'].apply(lambda x: x.date())

唯一的问题是df['dateonly']不会是DatetimeIndex

你需要先转换它：

df['dateonly'] = pd.to_datetime(df['dateonly'])

现在你可以对它进行分组

gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))

而额外的信息convention与PeriodIndex而非DatetimeIndex

如果将日期时间截断到给定日期的午夜时间，分组将按预期工作（从一天开始）。我期望它通过转换为日期时间来工作，例如

df['date'] = df['time'].apply(lambda x:x.date())

但是，除非索引是datetime TimeGrouper否则不能使用TimeGrouper 。相反，您有两个选项，可以将日期时间直接截至午夜，如下所示：

df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))

或者，您可以先使用pd.to_datetime()函数生成date值，然后将其转换回日期时间：

df['date'] = df['time'].apply(lambda x: x.date() )
df['date'] = pd.to_datetime(df['date'])

链接地址: http://www.djcxy.com/p/37019.html

上一篇: Grouping Pandas DataFrame by n days starting in the begining of the day

下一篇: Is it possible to develop iOS apps with Flutter on a Linux virtual machine?