Creating an empty Pandas DataFrame, then filling it?
I'm starting from the pandas Data Frame docs here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html
I'd like to iteratively fill the Data Frame with values in a time series kind of calculation. So basically, I'd like to initialize, data frame with columns A,B and timestamp rows, all 0 or all NaN.
I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.
I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a data frame directly or just a better way in general. Note: I'm using Python 2.7.
import datetime as dt
import pandas as pd
import scipy as s
if __name__ == '__main__':
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
dates.sort()
valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
valdict[symb] = pd.Series( s.zeros( len(dates)), dates )
for thedate in dates:
if thedate > dates[0]:
for symb in valdict:
valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]
print valdict
Here's a couple of suggestions:
Use date_range
for the index:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaN
s) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # with 0s rather than NaNs
To do these type of calculations for the data, use a numpy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
Have a look at this
newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional
If you want to have you column names in place from the start, use this approach: import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df
If you want to add a record to the dataframe it would be better to use
my_df.loc[len(my_df)] = [2, 4, 5]
However if you want to add another dataframe to my_df do as follows:
col_names = ['A', 'B', 'C']
my_df2 = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)
if you are adding rows inside a loop consider performance issues; for around first 1000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.
If you plan to do thins inside a big loop(say 10M records or so) you are better to use a mixture of these two; fill a datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times
链接地址: http://www.djcxy.com/p/70922.html