tidyverse method for reading CSV section
Scenario: You have a CSV file with data in sections, eg
[Car data]
mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
14.3,8,360,245,3.21,3.57,15.84,0,0,3,4 ...
[Other stuff]
Forgive the formatting. I had to add extra new lines to get the block quoting to at least resemble the intended data format. I'll create a reproducible example using mtcars below and pretend we've done the easy bit of subsetting the rows we want, for example as per the motivating code quoted here:
# Import raw data:
data_raw <- readLines("test.txt")
# find separation line:
id_sep <- which(data_raw=="")
# create ranges of both data sets:
data_1_range <- 4:(id_sep-1)
data_2_range <- (id_sep+4):length(data_raw)
# using ranges and row data import it:
data_1 <- read.csv(textConnection(data_raw[data_1_range]))
data_2 <- read.csv(textConnection(data_raw[data_2_range]))
from this post. In other words, the approach we're looking at adopting is to read the data in once, as lines, find the lines we want, and then "read" them using read.csv to get a data.frame.
Okay, so the year is now 2017 and we want to embrace the tidyverse world and use read_lines in place of readLines, and read_csv in place of read.csv.
library(tidyverse)
write_csv(mtcars, "mtcars_local.csv")
# this creates an easily reproduced local file
data_raw <- readLines("mtcars_local.csv")
# henceforth assume we've found the desired rows and subsetted
data_df <- read.csv(textConnection(data_raw))
head(data_df)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# whoo hoo, the above is exactly the output we want (replicating
# the original post answer)
data_raw_2 <- read_lines("mtcars_local.csv")
data_df_2 <- read_csv(textConnection(data_raw_2))
#Error in read_connection_(con) :
# Evaluation error: can only read from a binary connection.
So read_csv doesn't like taking a textConnection like read.csv did. The documentation for read_csv does say:
Arguments:
file: Either a path to a file, a connection, or literal data
(either a single string or a raw vector).
So, question(s):
We can create a single string of data with rows separated by the required newline:
paste0(data_raw, collapse = "n") [1] "mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carbn21,6,160,110,3.9,2.62,16.46,0,1,4,4n21,6,160,110,...
data_df_2 <- read_csv(paste0(data_raw, collapse = "n"))
head(data_df_2)
# A tibble: 6 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Okay, et voila. In writing this post I've come up with an answer. But the use of paste seems klunky. Maybe I've been spoilt by reading about the glue package. But is there a "tidy"er way of getting a section of data from a CSV into a tibble?
链接地址: http://www.djcxy.com/p/30968.html上一篇: 正常变量和因子变量之间的图中的颜色差异
下一篇: 用于阅读CSV部分的翻转方法