Appending Column to Frame of HDF File in Pandas

2018-06-19 00:09:16

I am working with a large dataset in CSV format. I am trying to process the data column-by-column, then append the data to a frame in an HDF file. All of this is done using Pandas. My motivation is that, while the entire dataset is much bigger than my physical memory, the column size is managable. At a later stage I will be performing feature-wise logistic regression by loading the columns back into memory one by one and operating on them.

I am able to make a new HDF file and make a new frame with the first column:

hdf_file = pandas.HDFStore('train_data.hdf')
feature_column = pandas.read_csv('data.csv', usecols=[0])
hdf_file.append('features', feature_column)

But after that, I get a ValueError when trying to append a new column to the frame:

feature_column = pandas.read_csv('data.csv', usecols=[1])
hdf_file.append('features', feature_column)

Stack trace and error message:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 658, in append self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 923, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2985, in write **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 2675, in create_axes raise ValueError("cannot match existing table structure for [%s] on appending data" % items)
ValueError: cannot match existing table structure for [srch_id] on appending data

I am new to working with large datasets and limited memory, so I am open to suggestions for alternate ways to work with this data.

complete docs are here, and some cookbook strategies here

PyTables is row-oriented, so you can only append rows. Read the csv chunk-by-chunk then append the entire frame as you go, something like this:

store = pd.HDFStore('file.h5',mode='w')
for chunk in read_csv('file.csv',chunksize=50000):
         store.append('df',chunk)
store.close()

You must be a tad careful as it is possiible for the dtypes of the resultant frrame when read chunk-by-chunk to have different dtypes, eg you have a integer like column that doesn't have missing values until say the 2nd chunk. The first chunk would have that column as an int64 , while the second as float64 . You may need to force dtypes with the dtype keyword to read_csv , see here.

here is a similar question as well.

链接地址: http://www.djcxy.com/p/53540.html

上一篇: 从系列创建大熊猫数据框

下一篇: 在Pandas中将列附加到HDF文件的帧中