Grouping Pandas DataFrame by n days starting in the begining of the day

2018-06-12 23:11:25

I have just discovered the power of Pandas and I love it, but I can't figure out this problem:

I have a DataFrame df.head() :

   lon   lat  h  filename                  time
0  19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1  19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2  19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3  19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
4  19.570708  80.588183     0.162943     60048 2002-05-15 12:59:32.888467

I would like to group my data into nine day intervals

gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))

The first group:

2002-05-15 12:59:31.717467       lon   lat  h filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

Next group:

2002-05-24 12:59:31.717467        lon   lat  height  filename                  time
815   18.309498  80.457024     0.187387     60309 2002-05-24 16:35:39.553563
816   18.291458  80.458514     0.061446     60309 2002-05-24 16:35:39.610563
817   18.273408  80.460014     0.129255     60309 2002-05-24 16:35:39.667563
818   18.255358  80.461504     0.046761     60309 2002-05-24 16:35:39.724563
...

So the data are grouped in nine days counting from the first time ( 12:59:31.717467), and not from the beginning of the day as I would like.

When grouping by one day:

gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))

gives me:

2002-05-15 00:00:00       lon   lat  h  filename                  time
0    19.961216  80.617627    -0.077165     60048 2002-05-15 12:59:31.717467
1    19.923916  80.614847    -0.018689     60048 2002-05-15 12:59:31.831467
2    19.849396  80.609257    -0.089205     60048 2002-05-15 12:59:32.059467
3    19.830776  80.607857     0.076485     60048 2002-05-15 12:59:32.116467
...

I can just loop over the days until I get a nine day interval, but I think it could be done smarter, I am looking for a Grouper freq option equivalent to YS (start of year) just for days, a way of setting the start time (maybe by the Grouper option convention : {'start', 'end', 'e', 's'} ), or???

I am running Python 3.5.2 and Pandas is in version: 0.19.0

Dropping first time row:

Your best bet would be to normalize the first row of the datetime column so that the time is reset to 00:00:00 (midnight) and group according to the 9D interval:

df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0  19.961216  80.617627 -0.077165     60048 2002-05-15 00:00:00.000000
# 1  19.923916  80.614847 -0.018689     60048 2002-05-15 12:59:31.831467
# 2  19.849396  80.609257 -0.089205     60048 2002-05-15 12:59:32.059467
# 3  19.830776  80.607857  0.076485     60048 2002-05-15 12:59:32.116467
# 4  19.570708  80.588183  0.162943     60048 2002-05-15 12:59:32.888467
# ......................................................................

This restores the time in the other rows and so you wouldn't lose that information.

Keeping first time row:

If you want to keep the first time row as it is and not make any changes to it, but only want to start grouping from midnight onwards, you could do:

df_t_shift = df.shift()    # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)  

for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
    print (grp)

#          lon        lat         h  filename                       time
# 0        NaN        NaN       NaN       NaN 2002-05-15 00:00:00.000000
# 1  19.961216  80.617627 -0.077165   60048.0 2002-05-15 12:59:31.717467
# 2  19.923916  80.614847 -0.018689   60048.0 2002-05-15 12:59:31.831467
# 3  19.849396  80.609257 -0.089205   60048.0 2002-05-15 12:59:32.059467
# 4  19.830776  80.607857  0.076485   60048.0 2002-05-15 12:59:32.116467
# 5  19.570708  80.588183  0.162943   60048.0 2002-05-15 12:59:32.888467

completing @mfitzp answer you could do this:

df['dateonly'] = df['time'].apply(lambda x: x.date())

only problem with that is df['dateonly'] would not be a DatetimeIndex

you need to convert it first:

df['dateonly'] = pd.to_datetime(df['dateonly'])

now you can group on it

gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))

and for extra information convention is used with PeriodIndex not DatetimeIndex

If you truncate the datetimes to midnight of the given day, the grouping will work as expected (starting at the beginning of the day). I expected it to work by converting to datetimes, eg

df['date'] = df['time'].apply(lambda x:x.date())

However, you can not use TimeGrouper unless the index is a datetime . You instead have two options, either truncate the datetimes to midnight directly as follows:

df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))

Alternatively, you can first generate date values, then convert them back to datetimes, using the pd.to_datetime() function:

df['date'] = df['time'].apply(lambda x: x.date() )
df['date'] = pd.to_datetime(df['date'])

链接地址: http://www.djcxy.com/p/37020.html

上一篇: 使用拼写检查在contenteditable <div>内复制点击

下一篇: 将熊猫DataFrame从一天开始的n天分组