Grouping Pandas DataFrame by n days starting in the begining of the day
I have just discovered the power of Pandas and I love it, but I can't figure out this problem:
I have a DataFrame df.head()
:
lon lat h filename time
0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467
1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
4 19.570708 80.588183 0.162943 60048 2002-05-15 12:59:32.888467
I would like to group my data into nine day intervals
gb = df.groupby(pd.TimeGrouper(key='time', freq='9D'))
The first group:
2002-05-15 12:59:31.717467 lon lat h filename time
0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467
1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
...
Next group:
2002-05-24 12:59:31.717467 lon lat height filename time
815 18.309498 80.457024 0.187387 60309 2002-05-24 16:35:39.553563
816 18.291458 80.458514 0.061446 60309 2002-05-24 16:35:39.610563
817 18.273408 80.460014 0.129255 60309 2002-05-24 16:35:39.667563
818 18.255358 80.461504 0.046761 60309 2002-05-24 16:35:39.724563
...
So the data are grouped in nine days counting from the first time ( 12:59:31.717467), and not from the beginning of the day as I would like.
When grouping by one day:
gb = df.groupby(pd.TimeGrouper(key='time', freq='D'))
gives me:
2002-05-15 00:00:00 lon lat h filename time
0 19.961216 80.617627 -0.077165 60048 2002-05-15 12:59:31.717467
1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
...
I can just loop over the days until I get a nine day interval, but I think it could be done smarter, I am looking for a Grouper freq
option equivalent to YS (start of year) just for days, a way of setting the start time (maybe by the Grouper option convention : {'start', 'end', 'e', 's'}
), or???
I am running Python 3.5.2 and Pandas is in version: 0.19.0
Dropping first time row:
Your best bet would be to normalize
the first row of the datetime
column so that the time is reset to 00:00:00
(midnight) and group according to the 9D interval:
df.loc[0, 'time'] = df['time'].iloc[0].normalize()
for _, grp in df.groupby(pd.TimeGrouper(key='time', freq='9D')):
print (grp)
# lon lat h filename time
# 0 19.961216 80.617627 -0.077165 60048 2002-05-15 00:00:00.000000
# 1 19.923916 80.614847 -0.018689 60048 2002-05-15 12:59:31.831467
# 2 19.849396 80.609257 -0.089205 60048 2002-05-15 12:59:32.059467
# 3 19.830776 80.607857 0.076485 60048 2002-05-15 12:59:32.116467
# 4 19.570708 80.588183 0.162943 60048 2002-05-15 12:59:32.888467
# ......................................................................
This restores the time in the other rows and so you wouldn't lose that information.
Keeping first time row:
If you want to keep the first time row as it is and not make any changes to it, but only want to start grouping from midnight onwards, you could do:
df_t_shift = df.shift() # Shift one level down
df_t_shift.loc[0, 'time'] = df_t_shift['time'].iloc[1].normalize()
# Concat last row of df with the shifted one to account for the loss of row
df_t_shift = df_t_shift.append(df.iloc[-1], ignore_index=True)
for _, grp in df_t_shift.groupby(pd.TimeGrouper(key='time', freq='9D')):
print (grp)
# lon lat h filename time
# 0 NaN NaN NaN NaN 2002-05-15 00:00:00.000000
# 1 19.961216 80.617627 -0.077165 60048.0 2002-05-15 12:59:31.717467
# 2 19.923916 80.614847 -0.018689 60048.0 2002-05-15 12:59:31.831467
# 3 19.849396 80.609257 -0.089205 60048.0 2002-05-15 12:59:32.059467
# 4 19.830776 80.607857 0.076485 60048.0 2002-05-15 12:59:32.116467
# 5 19.570708 80.588183 0.162943 60048.0 2002-05-15 12:59:32.888467
completing @mfitzp answer you could do this:
df['dateonly'] = df['time'].apply(lambda x: x.date())
only problem with that is df['dateonly']
would not be a DatetimeIndex
you need to convert it first:
df['dateonly'] = pd.to_datetime(df['dateonly'])
now you can group on it
gb = df.groupby(pd.TimeGrouper(key='dateonly', freq='9D'))
and for extra information convention
is used with PeriodIndex
not DatetimeIndex
If you truncate the datetimes to midnight of the given day, the grouping will work as expected (starting at the beginning of the day). I expected it to work by converting to datetimes, eg
df['date'] = df['time'].apply(lambda x:x.date())
However, you can not use TimeGrouper
unless the index is a datetime
. You instead have two options, either truncate the datetimes to midnight directly as follows:
df['date'] = df['time'].apply(lambda x:x.replace(hour=0, minute=0, second=0, microsecond=0)))
Alternatively, you can first generate date
values, then convert them back to datetimes, using the pd.to_datetime()
function:
df['date'] = df['time'].apply(lambda x: x.date() )
df['date'] = pd.to_datetime(df['date'])
链接地址: http://www.djcxy.com/p/37020.html