Using dplyr window functions to calculate percentiles
I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.
Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:
library(dplyr)
library(tidyr)
# load data
data("mtcars")
# Percentiles used in calculation
p <- c(.25,.5,.75)
# old dplyr solution
mtcars %>% group_by(cyl) %>%
do(data.frame(p=p, stats=quantile(.$mpg, probs=p),
n = length(.$mpg), avg = mean(.$mpg))) %>%
spread(p, stats) %>%
select(1, 4:6, 3, 2)
# note: the select and spread statements are just to get the data into
# the format in which I'd like to see it, but are not critical
Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.
Thank you
If you're up for using purrr::map
, you can do it like this!
library(dplyr)
library(tidyr)
library(broom)
library(purrr)
mtcars %>%
nest(-cyl) %>%
mutate(Quantiles = map(data, ~ quantile(.$mpg))) %>%
unnest(map(Quantiles, tidy))
#> # A tibble: 15 × 3
#> cyl names x
#> <dbl> <chr> <dbl>
#> 1 6 0% 17.80
#> 2 6 25% 18.65
#> 3 6 50% 19.70
#> 4 6 75% 21.00
#> 5 6 100% 21.40
#> 6 4 0% 21.40
#> 7 4 25% 22.80
#> 8 4 50% 26.00
#> 9 4 75% 30.40
#> 10 4 100% 33.90
#> 11 8 0% 10.40
#> 12 8 25% 14.40
#> 13 8 50% 15.20
#> 14 8 75% 16.25
#> 15 8 100% 19.20
One nice thing about this approach is the output is tidy, one observation per row.
Here's a dplyr
approach that avoids do
but requires a separate call to quantile
for each quantile value.
mtcars %>% group_by(cyl) %>%
summarise(`25%`=quantile(mpg, probs=0.25),
`50%`=quantile(mpg, probs=0.5),
`75%`=quantile(mpg, probs=0.75),
avg=mean(mpg),
n=n())
cyl 25% 50% 75% avg n
1 4 22.80 26.0 30.40 26.66364 11
2 6 18.65 19.7 21.00 19.74286 7
3 8 14.40 15.2 16.25 15.10000 14
It would be better if summarise
could return multiple values with a single call to quantile
, but this appears to be an open issue in dplyr
development.
UPDATE: Here's a variation on @JuliaSilge's answer that uses nesting to get the quantiles, but without the use of map
. It does, however, require an extra line of code to add a column listing the quantile levels, as I'm not sure how (or if it's possible) to capture the names of the quantiles into a separate column directly from the call to quantile
.
p = c(0.25,0.5,0.75)
mtcars %>%
group_by(cyl) %>%
summarise(quantiles = list(sprintf("%1.0f%%", p*100)),
mpg = list(quantile(mpg, p))) %>%
unnest
cyl quantiles mpg
1 4 25% 22.80
2 4 50% 26.00
3 4 75% 30.40
4 6 25% 18.65
5 6 50% 19.70
6 6 75% 21.00
7 8 25% 14.40
8 8 50% 15.20
9 8 75% 16.25
This is a dplyr
approach that uses the tidy()
function of the broom
package, unfortunately it still requires do()
, but it is a lot simpler.
library(dplyr)
library(broom)
mtcars %>%
group_by(cyl) %>%
do( tidy(t(quantile(.$mpg))) )
which gives:
cyl X0. X25. X50. X75. X100.
(dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 4 21.4 22.80 26.0 30.40 33.9
2 6 17.8 18.65 19.7 21.00 21.4
3 8 10.4 14.40 15.2 16.25 19.2
Note the use of t()
since the broom
package does not have a method for named numerics.
This is based on my earlier answer for summary() here.
链接地址: http://www.djcxy.com/p/24838.html上一篇: 预测多元线性模型的值
下一篇: 使用dplyr窗口函数来计算百分位数