Creation of large pandas DataFrames from Series

I'm dealing with data on a fairly large scale. For reference, a given sample will have ~75,000,000 rows and 15,000-20,000 columns.

As of now, to conserve memory I've taken the approach of creating a list of Series (each column is a series, so ~15K-20K Series each containing ~250K rows). Then I create a SparseDataFrame containing every index within these series (because as you notice, this is a large but not very dense dataset). The issue is this becomes extremely slow, and appending each column to the dataset takes several minutes. To overcome this I've tried batching the merges as well (select a subset of the data, merge these to a DataFrame, which is then merged into my main DataFrame), but this approach is still too slow. Slow meaning it only processed ~4000 columns in a day, with each append causing subsequent appends to take longer as well.

One part which struck me as odd is why my column count of the main DataFrame affects the append speed. Because my main index already contains all entries it will ever see, I shouldn't have to lose time due to re-indexing.

In anycase, here is my code:

import time
import sys
import numpy as np
import pandas as pd
precision = 6
df = []
for index, i in enumerate(raw):
    if i is None:
        break
    if index%1000 == 0:
        sys.stderr.write('Processed %s...n' % index)
    df.append(pd.Series(dict([(np.round(mz, precision),int(intensity)) for mz, intensity in i.scans]), dtype='uint16', name=i.rt))

all_indices = set([])
for j in df:
    all_indices |= set(j.index.tolist())

print len(all_indices)
t = time.time()
main_df = pd.DataFrame(index=all_indices)
first = True
del all_indices
while df:
    subset = [df.pop() for i in xrange(10) if df]
    all_indices = set([])
    for j in subset:
        all_indices |= set(j.index.tolist())
    df2 = pd.DataFrame(index=all_indices)
    df2.sort_index(inplace=True, axis=0)
    df2.sort_index(inplace=True, axis=1)
    del all_indices
    ind=0
    while subset:
        t2 = time.time()
        ind+=1
        arr = subset.pop()
        df2[arr.name] = arr
        print ind,time.time()-t,time.time()-t2
    df2.reindex(main_df.index)
    t2 = time.time()
    for i in df2.columns:
        main_df[i] = df2[i]
    if first:
        main_df = main_df.to_sparse()
        first = False
    print 'join time', time.time()-t,time.time()-t2
    print len(df), 'entries remain'

Any advice on how I can load this large dataset quickly is appreciated, even if it means writing it to disk to some other format first/etc.

Some additional info:

1) Because of the number of columns, I can't use most traditional on-disk stores such as HDF.

2) The data will be queried across columns and rows when it is in use. So main_df.loc[row:row_end, col:col_end]. These aren't predictable block sizes so chunking isn't really an option. These lookups also need to be fast, on the order of ~10 a second to be realistically useful.

3) I have 32G of memory, so a SparseDataFrame I think is the best option since it fits in memory and allows fast lookups as needed. Just the creation of it is a pain at the moment.

Update:

I ended up using scipy sparse matrices and handling the indexing on my own for the time being. This results in appends at a constant rate of ~0.2 seconds which is acceptable (versus Pandas taking ~150seconds for my full dataset per append). I'd love to know how to make Pandas match this speed.

链接地址: http://www.djcxy.com/p/53542.html

上一篇: 大数据与使用熊猫的数据透视表

下一篇: 从系列创建大熊猫数据框