creating NA?
I'm trying to calculate the mean number of unique fruits per person (my usual practice data). This works perfectly well with both these lines of code:
with(df, tapply(fruit, names, FUN = function(x) length(unique(x))))->uniques
sum(uniques)/length(unique(df$names))
aggregate(df[,"fruit"], by=list(id=names), FUN = function(x) length(unique(x)))->d1
sum(d1$x)/length(unique(df$names))
My problem is that when I use the code on my real data it doesn't work. My real data is prescribing data, where I want mean number of unique drugs per person. With the tapply code, it has appeared to create brand new patient ids that do not exist in the original df. it has also given back 1000s of NA values. There are no missing values in my id column and none in drug_code column either
with(dt3, tapply(drug_code, id, FUN = function(x) length(unique(x))))->uniques
head(uniques)
uniques
Patient HAI0000001 NA
Patient HAI0000003 NA
Patient HAI0000008 NA
Patient HAI0000010 NA
Patient HAI0000014 NA
Patient HAI0000020 NA
table(dt3$id=="Patient HAI0000001") ##checking to see if HA10000001 occurs in original df. the dim of df are 228954 rows and 5 cols
FALSE
228954
For the aggregate code I get an error:
aggregate(dt3[,"drug_code"], by=list(id=id), FUN = function(x) length(unique(x)))->d1
Error in aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
I don't understand whats happening. My real data is similar to my practice data in that it has an id col and has a drug/fruit column. there are no missing data in either df. I know lapply is better for dataframes, but I don't necessarily need a df back. And in any case the tapply code works on practice data which is a df. Does anyone have any idea of what is happening here?
Practice DF:
names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary", "jim", "sylvia","ted","ted","mary", "sylvia", "jim", "ted", "john", "ted"))
dates<-as.Date(c("2010-07-01", "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-08-12", "2010-11-11", "2010-05-12", "2010-12-03", "2010-07-12", "2010-12-21", "2010-02-18", "2010-10-29", "2010-08-13", "2010-11-11", "2010-05-12", "2010-04-01", "2010-05-06", "2010-09-28", "2010-11-28" ))
fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple", "pineapple", "peach", "nectarine", "grape", "melon", "apricot", "plum", "lychee", "watermelon", "apple" ))
df<-data.frame(names,dates,fruit)
example of real data:
head(dt3)
id quantity date_of_claim drug_code index
1 Patient HAI0000560 1 2009-10-15 R03AC02 2010-04-06
2 Patient HAI0000560 1 2009-10-15 R03AK06 2010-04-06
3 Patient HAI0000560 30 2009-10-15 R03BB04 2010-04-06
4 Patient HAI0000560 30 2009-10-15 A02BC01 2010-04-06
5 Patient HAI0000560 50 2009-10-15 M02AA15 2010-04-06
6 Patient HAI0000560 30 2009-10-15 N02BE51 2010-04-06
In your case you are asking fir a single number: the mean of all the individual lengths of a particular vector (unique(fruits))
within patient-id. This shws you first the indivdual unique counts and then the mean function result:
> with(df, tapply(fruit, names, function(x) length(unique(x)) ))
jim john mary sylvia ted
2 5 3 2 4
> mean ( with(df, tapply(fruit, names, function(x) length(unique(x)) )) )
[1] 3.2
I would comment that your test for containment of a particular value in your code above had a trailing space which might have caused problems. "string "
will not equal "string"
. I have put a copy of the use the trim function in pkg::gdata
in my .Rprofile file to make it easier for me to handle this possibility.
I might be missing something, but wouldn't a simple tapply
work here? The line below calculates the number of different fruits per person
x=tapply(df$fruit,df$names,function(x){length(unique(x))})
And then mean(x)
would give you the average across people?
上一篇: 总计超过2组
下一篇: 创造NA?