tidyverse method for reading CSV section

2018-06-10 11:15:37

Scenario: You have a CSV file with data in sections, eg

[Car data]

mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb

21,6,160,110,3.9,2.62,16.46,0,1,4,4

21,6,160,110,3.9,2.875,17.02,0,1,4,4

22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

21.4,6,258,110,3.08,3.215,19.44,1,0,3,1

18.7,8,360,175,3.15,3.44,17.02,0,0,3,2

18.1,6,225,105,2.76,3.46,20.22,1,0,3,1

14.3,8,360,245,3.21,3.57,15.84,0,0,3,4 ...

[Other stuff]

Forgive the formatting. I had to add extra new lines to get the block quoting to at least resemble the intended data format. I'll create a reproducible example using mtcars below and pretend we've done the easy bit of subsetting the rows we want, for example as per the motivating code quoted here:

# Import raw data:
data_raw <- readLines("test.txt")

# find separation line:
id_sep <- which(data_raw=="")

# create ranges of both data sets:
data_1_range <- 4:(id_sep-1)
data_2_range <- (id_sep+4):length(data_raw)

# using ranges and row data import it:
data_1 <- read.csv(textConnection(data_raw[data_1_range]))
data_2 <- read.csv(textConnection(data_raw[data_2_range]))

from this post. In other words, the approach we're looking at adopting is to read the data in once, as lines, find the lines we want, and then "read" them using read.csv to get a data.frame.

Okay, so the year is now 2017 and we want to embrace the tidyverse world and use read_lines in place of readLines, and read_csv in place of read.csv.

library(tidyverse)

write_csv(mtcars, "mtcars_local.csv")
# this creates an easily reproduced local file

data_raw <- readLines("mtcars_local.csv")
# henceforth assume we've found the desired rows and subsetted

data_df <- read.csv(textConnection(data_raw))

head(data_df)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# whoo hoo, the above is exactly the output we want (replicating
# the original post answer)

data_raw_2 <- read_lines("mtcars_local.csv")

data_df_2 <- read_csv(textConnection(data_raw_2))
#Error in read_connection_(con) : 
#  Evaluation error: can only read from a binary connection.

So read_csv doesn't like taking a textConnection like read.csv did. The documentation for read_csv does say:

Arguments:

file: Either a path to a file, a connection, or literal data
      (either a single string or a raw vector).

So, question(s):

Is there a neat tidyverse way of getting a particular delimited section of a CSV into a tibble? (that doesn't involve reading in the lines and subsetting as an interim step)

Or from such a vector of strings of each line, how can you get them into a tibble?

We can create a single string of data with rows separated by the required newline:

paste0(data_raw, collapse = "n") [1] "mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carbn21,6,160,110,3.9,2.62,16.46,0,1,4,4n21,6,160,110,...

data_df_2 <- read_csv(paste0(data_raw, collapse = "n"))

head(data_df_2)
# A tibble: 6 x 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

Okay, et voila. In writing this post I've come up with an answer. But the use of paste seems klunky. Maybe I've been spoilt by reading about the glue package. But is there a "tidy"er way of getting a section of data from a CSV into a tibble?

链接地址: http://www.djcxy.com/p/30968.html

上一篇: 正常变量和因子变量之间的图中的颜色差异

下一篇: 用于阅读CSV部分的翻转方法