Large data with pivot table using Pandas

2018-06-19 00:11:21

I'm currently using Postgres database to store survey answers.

My problem I'm facing is that I need to generate pivot table from Postgres database.

When the dataset is small, it's easy to just read whole data set and use Pandas to produce the pivot table.

However, my current database now has around 500k rows, and it's increasing around 1000 rows per day. Reading whole dataset is not effective anymore.

My question is that do I need to use HDFS to store data on disk and supply it to Pandas to do pivoting?

My customers need to view pivot table output nearly real time. Do we have any way to solve it?

My theory is that I'll create pivot table output of 500k rows and store the output somewhere, then when new data gets saved into database, I'll only need to merge the new data with existing pivot table. I'm not quite sure if Pandas supports this way, or it needs a full dataset to do pivoting?

Have you tried using pickle . I'm a data scientist and use this all the time with data sets of 1M+ rows and several hundred columns.

In your particular case I would recommend the following.

import pickle 
save_data = open('path/file.pickle', 'wb') #wb stands for write bytes
pickle.dump(pd_data, save_data)
save_data.close()

In the above code what you're doing is saving your data in a compact format that can quickly be loaded using:

pickle_data = open('path/file.pickle', 'rb') #rb stands for read bytes 
pd_data = pickle.load(pickle_data)
pickle_data.close()

At which point you can append your data (pd_data) with the new 1,000 rows and save it again using pickle. If your data will continue to grow and you expect memory to become a problem I suggest identifying a way to append or concatenate the data rather than merge or join since the latter two can also result in memory issues.

You will find that this will cut out significant load time when reading something off your disk (I use Dropbox and its still lightning fast). What I usually do in order to reduce that even further is segment my data sets into groups of rows & columns and then write methods that load the pickled data as need be (super useful graphing).

链接地址: http://www.djcxy.com/p/53544.html

上一篇: 为什么Python不适合函数式编程？

下一篇: 大数据与使用熊猫的数据透视表