subset across multiple columns in r
I am a student taking R. I have directory with lot's of files inside of it. i need to write a function named 'pollutantmean' to calculate the mean of a pollutant (either sulfate or nitrate) from a data set (see example below) the function takes three arguments: 'directory', 'pollutant', and 'id'.
as a part of my function, I have successfully read all the files and put them into single file so that i can now do some calculations like median, mean, etc. i have used rbind and a for loop to create data.frame.
But the problem is that after creating data frame, i now need to find a way to subset my data by one or more of the columns in my data frame, either of the columns 2 or 3
i am given a prototype of the function as follows:
pollutantmean <- function(directory, pollutant, id = 1:332) { ## 'directory' is a character vector of length 1 indicating the location of the CSV files ## 'pollutant' is a character vector of length 1 indicating the name of the pollutant for which we will calculate the mean; either "sulfate" or "nitrate". ## 'id' is an integer vector indicating the monitor ID numbers to be used ## Return the mean of the pollutant across all monitors list in the 'id' vector (ignoring NA values)
Here is an example of the output of this function:
pollutantmean("specdata", "sulfate", 1:10)
## [1] 4.064
pollutantmean("specdata", "nitrate", 70:72)
## [1] 1.706
pollutantmean("specdata", "nitrate", 23)
## [1] 1.281
Here is what i have as a first experiment, using just a single ID and a single pollutant type (sulfate)
pollutantmean <- function(directory, pollutant, ID = 1:332) {
data <- read.csv("specdata/001.csv")
subset(data, data$ID == 1)
mean(data$sulfate, na.rm = TRUE)
}
pollutantmean("specdata", "sulfate", 1)
[1] 3.880701
What i cannot figure out how to do is to calculate the mean of either pollutant type, either 'sulfate' or 'nitrate'
Can anyone provide some advice regarding my next steps?
here is an example of my data
"Date","sulfate","nitrate","ID"
"2003-01-01",NA,NA,1
"2003-01-02",NA,NA,1
"2003-01-03",NA,NA,1
"2003-01-04",NA,NA,1
"2003-01-05",NA,NA,1
我认为以下将帮助你..它也将帮助你从子集...
mean(data[data$ID %in% id,pollutant],na.rm=T)
以下方法可能会有所帮助:
ddf = structure(list(Date = structure(1:5, .Label = c("2003-01-01",
"2003-01-02", "2003-01-03", "2003-01-04", "2003-01-05"), class = "factor"),
sulfate = c(50L, 75L, 85L, 45L, 25L), nitrate = c(854L, 658L,
485L, 458L, 152L), ID = c(1L, 1L, 2L, 1L, 2L)), .Names = c("Date",
"sulfate", "nitrate", "ID"), class = "data.frame", row.names = c(NA,
-5L))
ddf
Date sulfate nitrate ID
1 2003-01-01 50 854 1
2 2003-01-02 75 658 1
3 2003-01-03 85 485 2
4 2003-01-04 45 458 1
5 2003-01-05 25 152 2
ddfm = melt(ddf[,2:4], id="ID")
ddfm
ID variable value
1 1 sulfate 50
2 1 sulfate 75
3 2 sulfate 85
4 1 sulfate 45
5 2 sulfate 25
6 1 nitrate 854
7 1 nitrate 658
8 2 nitrate 485
9 1 nitrate 458
10 2 nitrate 152
with(ddfm, tapply(value, list(variable, ID), mean))
1 2
sulfate 56.66667 55.0
nitrate 656.66667 318.5
链接地址: http://www.djcxy.com/p/73898.html
上一篇: 对子集函数的标准评估参数
下一篇: 子集r中的多个列