subset across multiple columns in r

2018-06-26 10:02:18

I am a student taking R. I have directory with lot's of files inside of it. i need to write a function named 'pollutantmean' to calculate the mean of a pollutant (either sulfate or nitrate) from a data set (see example below) the function takes three arguments: 'directory', 'pollutant', and 'id'.

as a part of my function, I have successfully read all the files and put them into single file so that i can now do some calculations like median, mean, etc. i have used rbind and a for loop to create data.frame.

But the problem is that after creating data frame, i now need to find a way to subset my data by one or more of the columns in my data frame, either of the columns 2 or 3

i am given a prototype of the function as follows:

pollutantmean <- function(directory, pollutant, id = 1:332) { ## 'directory' is a character vector of length 1 indicating the location of the CSV files ## 'pollutant' is a character vector of length 1 indicating the name of the pollutant for which we will calculate the mean; either "sulfate" or "nitrate". ## 'id' is an integer vector indicating the monitor ID numbers to be used ## Return the mean of the pollutant across all monitors list in the 'id' vector (ignoring NA values)

Here is an example of the output of this function:

pollutantmean("specdata", "sulfate", 1:10)
## [1] 4.064
pollutantmean("specdata", "nitrate", 70:72)
## [1] 1.706
pollutantmean("specdata", "nitrate", 23)
## [1] 1.281

Here is what i have as a first experiment, using just a single ID and a single pollutant type (sulfate)

pollutantmean <- function(directory, pollutant, ID = 1:332) {
         data <- read.csv("specdata/001.csv")
         subset(data, data$ID == 1)
         mean(data$sulfate, na.rm = TRUE)
}
pollutantmean("specdata", "sulfate", 1)
[1] 3.880701

What i cannot figure out how to do is to calculate the mean of either pollutant type, either 'sulfate' or 'nitrate'

Can anyone provide some advice regarding my next steps?

here is an example of my data

"Date","sulfate","nitrate","ID"
"2003-01-01",NA,NA,1
"2003-01-02",NA,NA,1
"2003-01-03",NA,NA,1
"2003-01-04",NA,NA,1
"2003-01-05",NA,NA,1

我认为以下将帮助你..它也将帮助你从子集...

mean(data[data$ID %in% id,pollutant],na.rm=T)

以下方法可能会有所帮助：

ddf = structure(list(Date = structure(1:5, .Label = c("2003-01-01", 
"2003-01-02", "2003-01-03", "2003-01-04", "2003-01-05"), class = "factor"), 
    sulfate = c(50L, 75L, 85L, 45L, 25L), nitrate = c(854L, 658L, 
    485L, 458L, 152L), ID = c(1L, 1L, 2L, 1L, 2L)), .Names = c("Date", 
"sulfate", "nitrate", "ID"), class = "data.frame", row.names = c(NA, 
-5L))

ddf
        Date sulfate nitrate ID
1 2003-01-01      50     854  1
2 2003-01-02      75     658  1
3 2003-01-03      85     485  2
4 2003-01-04      45     458  1
5 2003-01-05      25     152  2

ddfm = melt(ddf[,2:4], id="ID")
ddfm
   ID variable value
1   1  sulfate    50
2   1  sulfate    75
3   2  sulfate    85
4   1  sulfate    45
5   2  sulfate    25
6   1  nitrate   854
7   1  nitrate   658
8   2  nitrate   485
9   1  nitrate   458
10  2  nitrate   152


with(ddfm, tapply(value, list(variable, ID), mean))

                1     2
sulfate  56.66667  55.0
nitrate 656.66667 318.5

链接地址: http://www.djcxy.com/p/73898.html

上一篇: 对子集函数的标准评估参数

下一篇: 子集r中的多个列