How to estimate how much memory a Pandas' DataFrame will need?

2018-06-13 11:09:28

I have been wondering... If I am reading, say, a 400MB csv file into a pandas dataframe (using read_csv or read_table), is there any way to guesstimate how much memory this will need? Just trying to get a better feel of data frames and memory...

df.memory_usage() will return how much each column occupies:

>>> df.memory_usage()

Row_ID            20906600
Household_ID      20906600
Vehicle           20906600
Calendar_Year     20906600
Model_Year        20906600
...

To include indexes, pass index=True .

So to get overall memory consumption:

>>> df.memory_usage(index=True).sum()
731731000

Also, passing memory_usage='deep' will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.

This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).

You have to do this in reverse.

In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv')

In [5]: !ls -ltr test.csv
-rw-rw-r-- 1 users 399508276 Aug  6 16:55 test.csv

Technically memory is about this (which includes the indexes)

In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes
Out[16]: 168000160

So 168MB in memory with a 400MB file, 1M rows of 20 float columns

DataFrame(randn(1000000,20)).to_hdf('test.h5','df')

!ls -ltr test.h5
-rw-rw-r-- 1 users 168073944 Aug  6 16:57 test.h5

MUCH more compact when written as a binary HDF5 file

In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc')

In [13]: !ls -ltr test.h5
-rw-rw-r-- 1 users 154727012 Aug  6 16:58 test.h5

The data was random, so compression doesn't help too much

I thought I would bring some more data to the discussion.

I ran a series of tests on this issue.

By using the python resource package I got the memory usage of my process.

And by writing the csv into a StringIO buffer, I could easily measure the size of it in bytes.

I ran two experiments, each one creating 20 dataframes of increasing sizes between 10,000 lines and 1,000,000 lines. Both having 10 columns.

In the first experiment I used only floats in my dataset.

This is how the memory increased in comparison to the csv file as a function of the number of lines. (Size in Megabytes)

内存和CSV大小（以兆字节为单位），作为具有浮点条目的行数的函数

The second experiment I had the same approach, but the data in the dataset consisted of only short strings.

内存和CSV大小（以兆字节为单位），作为包含字符串条目的行数的函数

It seems that the relation of the size of the csv and the size of the dataframe can vary quite a lot, but the size in memory will always be bigger by a factor of 2-3 (for the frame sizes in this experiment)

I would love to complete this answer with more experiments, please comment if you want me to try something special.

链接地址: http://www.djcxy.com/p/38402.html

上一篇: 有没有办法以R格式存储熊猫数据框？

下一篇: 如何估算Pandas的DataFrame需要多少内存？