Quickly reading very large tables as dataframes in R
I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table()
has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.
I know that reading in a table as a list using scan()
can be quite fast, eg:
datalist <- scan('myfile',sep='t',list(url='',popularity=0,mintime=0,maxtime=0)))
But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:
df <- as.data.frame(scan('myfile',sep='t',list(url='',popularity=0,mintime=0,maxtime=0))))
Is there a better way of doing this? Or quite possibly completely different approach to the problem?
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using fread
in data.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.
Using read_table
in readr
(on CRAN from April 2015). This works much like fread
above. The readme in the link explains the difference between the two functions ( readr
currently claims to be "1.5-2x slower" than data.table::fread
).
read.csv.raw
from iotools
provides a third option for quickly reading CSV files.
Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql
in the sqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC
package, and the reverse depends section of the DBI
package page. MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv
function. dplyr
allows you to work directly with data stored in several types of database.
Storing data in binary formats can also be useful for improving performance. Use saveRDS
/ readRDS
(see below), the h5
or rhdf5
packages for HDF5 format, or write_fst
/ read_fst
from the fst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set nrows
=the number of records in your data ( nmax
in scan
).
Make sure that comment.char=""
to turn off interpretation of comments.
Explicitly define the classes of each column using colClasses
in read.table
.
Setting multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Here is an example that utilizes fread
from data.table
1.8.7
The examples come from the help page to fread
, with the timings on my windows XP Core 2 duo E8400.
library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
b=sample(1:1000,n,replace=TRUE),
c=rnorm(n),
d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
e=rnorm(n),
f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]
standard read.table
write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"n")
## File size (MB): 51
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 24.71 0.15 25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))
## user system elapsed
## 17.85 0.07 17.98
optimized read.table
system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",
stringsAsFactors=FALSE,comment.char="",nrows=n,
colClasses=c("integer","integer","numeric",
"character","numeric","integer")))
## user system elapsed
## 10.20 0.03 10.32
fread
require(data.table)
system.time(DT <- fread("test.csv"))
## user system elapsed
## 3.12 0.01 3.22
sqldf
require(sqldf)
system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))
## user system elapsed
## 12.49 0.09 12.69
# sqldf as on SO
f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
## user system elapsed
## 10.21 0.47 10.73
ff / ffdf
require(ff)
system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))
## user system elapsed
## 10.85 0.10 10.99
In summary:
## user system elapsed Method
## 24.71 0.15 25.42 read.csv (first time)
## 17.85 0.07 17.98 read.csv (second time)
## 10.20 0.03 10.32 Optimized read.table
## 3.12 0.01 3.22 fread
## 12.49 0.09 12.69 sqldf
## 10.21 0.47 10.73 sqldf on SO
## 10.85 0.10 10.99 ffdf
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf()
to do this.
There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf()
to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv
command ran all night and never completed.
Here's my test code:
Set up the test data:
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
I restarted R before running the following import routine:
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
I let the following line run all night but it never completed:
system.time(big.df <- read.csv('bigdf.csv'))
链接地址: http://www.djcxy.com/p/30826.html
上一篇: 在基本图形中绘制绘图区域外的图例?
下一篇: 在R中快速读取非常大的表格作为数据框