如何用R数据框中的零代替NA值？

2018-06-08 04:50:55

我有一个data.frame和一些列有NA值。我想用零代替NA 。我如何做到这一点？

在@ gsk3答案中看到我的评论。一个简单的例子：

> m <- matrix(sample(c(NA, 1:10), 100, replace = TRUE), 10)
> d <- as.data.frame(m)
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  3 NA  3  7  6  6 10  6   5
2   9  8  9  5 10 NA  2  1  7   2
3   1  1  6  3  6 NA  1  4  1   6
4  NA  4 NA  7 10  2 NA  4  1   8
5   1  2  4 NA  2  6  2  6  7   4
6  NA  3 NA NA 10  2  1 10  8   4
7   4  4  9 10  9  8  9  4 10  NA
8   5  8  3  2  1  4  5  9  4   7
9   3  9 10  1  9  9 10  5  3   3
10  4  2  2  5 NA  9  7  2  5   5

> d[is.na(d)] <- 0

> d
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1   4  3  0  3  7  6  6 10  6   5
2   9  8  9  5 10  0  2  1  7   2
3   1  1  6  3  6  0  1  4  1   6
4   0  4  0  7 10  2  0  4  1   8
5   1  2  4  0  2  6  2  6  7   4
6   0  3  0  0 10  2  1 10  8   4
7   4  4  9 10  9  8  9  4 10   0
8   5  8  3  2  1  4  5  9  4   7
9   3  9 10  1  9  9 10  5  3   3
10  4  2  2  5  0  9  7  2  5   5

没有必要apply 。 =）

编辑

你也应该看看norm包。它有很多用于缺失数据分析的好功能。 =）

对于单个矢量：

x <- c(1,2,NA,4,5)
x[is.na(x)] <- 0

对于data.frame，从上面创建一个函数，然后apply其应用于列。

请在下面提供一个可重复的示例，详细信息如下：

如何做一个伟大的R可重现的例子？

混合dplyr / Base R选项： mutate_all(funs(replace(., is.na(.), 0))))速度是基数R d[is.na(d)] <- 0两倍选项。（请参阅下面的基准分析。）

如果您正在处理大量数据帧， data.table是最快的选择：比dplyr少30％，比Base R方法快3倍。它还修改了数据，有效地允许您一次处理近两倍的数据。

聚类的其他有用的Tidyverse替代方法

Locationally：

index mutate_at(c(5:10), funs(replace(., is.na(.), 0)))

直接引用mutate_at(vars(var5:var10), funs(replace(., is.na(.), 0)))

固定匹配mutate_at(vars(contains("1")), funs(replace(., is.na(.), 0)))

或代替contains() ，请尝试ends_with() ， starts_with()

模式匹配mutate_at(vars(matches("d{2}")), funs(replace(., is.na(.), 0)))

有条件的：
（仅改变数字（列）并保留字符串（列））。

整数mutate_if(is.integer, funs(replace(., is.na(.), 0)))

双倍mutate_if(is.numeric, funs(replace(., is.na(.), 0)))

字符串mutate_if(is.character, funs(replace(., is.na(.), 0)))

完整的分析 -

测试方法：

# Base R: 
baseR.sbst.rssgn   <- function(x) { x[is.na(x)] <- 0; x }
baseR.replace      <- function(x) { replace(x, is.na(x), 0) }
baseR.for          <- function(x) { for(j in 1:ncol(x))
                                    x[[j]][is.na(x[[j]])] = 0 }
# tidyverse
## dplyr
library(tidyverse)
dplyr_if_else      <- function(x) { mutate_all(x, funs(if_else(is.na(.), 0, .))) }
dplyr_coalesce     <- function(x) { mutate_all(x, funs(coalesce(., 0))) }

## tidyr
tidyr_replace_na   <- function(x) { replace_na(x, as.list(setNames(rep(0, 10), as.list(c(paste0("var", 1:10)))))) }

## hybrid 
hybrd.ifelse     <- function(x) { mutate_all(x, funs(ifelse(is.na(.), 0, .))) }
hybrd.rplc_all   <- function(x) { mutate_all(x, funs(replace(., is.na(.), 0))) }
hybrd.rplc_at.idx<- function(x) { mutate_at(x, c(1:10), funs(replace(., is.na(.), 0))) }
hybrd.rplc_at.nse<- function(x) { mutate_at(x, vars(var1:var10), funs(replace(., is.na(.), 0))) }
hybrd.rplc_at.stw<- function(x) { mutate_at(x, vars(starts_with("var")), funs(replace(., is.na(.), 0))) }
hybrd.rplc_at.ctn<- function(x) { mutate_at(x, vars(contains("var")), funs(replace(., is.na(.), 0))) }
hybrd.rplc_at.mtc<- function(x) { mutate_at(x, vars(matches("d+")), funs(replace(., is.na(.), 0))) }
hybrd.rplc_if    <- function(x) { mutate_if(x, is.numeric, funs(replace(., is.na(.), 0))) }

# data.table   
library(data.table)
DT.for.set.nms   <- function(x) { for (j in names(x))
                                    set(x,which(is.na(x[[j]])),j,0) }
DT.for.set.sqln  <- function(x) { for (j in seq_len(ncol(x)))
                                    set(x,which(is.na(x[[j]])),j,0) }

此分析的代码：

library(microbenchmark)
# 20% NA filled dataframe of 5 Million rows and 10 columns
set.seed(42) # to recreate the exact dataframe
dfN <- as.data.frame(matrix(sample(c(NA, as.numeric(1:4)), 5e6*10, replace = TRUE),
                            dimnames = list(NULL, paste0("var", 1:10)), 
                            ncol = 10))
# Running 250 trials with each replacement method 
# (the functions are excecuted locally - so that the original dataframe remains unmodified in all cases)
perf_results <- microbenchmark(
    hybrid.ifelse    = hybrid.ifelse(copy(dfN)),
    dplyr_if_else    = dplyr_if_else(copy(dfN)),
    baseR.sbst.rssgn = baseR.sbst.rssgn(copy(dfN)),
    baseR.replace    = baseR.replace(copy(dfN)),
    dplyr_coalesce   = dplyr_coalesce(copy(dfN)),
    hybrd.rplc_at.nse= hybrd.rplc_at.nse(copy(dfN)),
    hybrd.rplc_at.stw= hybrd.rplc_at.stw(copy(dfN)),
    hybrd.rplc_at.ctn= hybrd.rplc_at.ctn(copy(dfN)),
    hybrd.rplc_at.mtc= hybrd.rplc_at.mtc(copy(dfN)),
    hybrd.rplc_at.idx= hybrd.rplc_at.idx(copy(dfN)),
    hybrd.rplc_if    = hybrd.rplc_if(copy(dfN)),
    tidyr_replace_na = tidyr_replace_na(copy(dfN)),
    baseR.for        = baseR.for(copy(dfN)),
    DT.for.set.nms   = DT.for.set.nms(copy(dfN)),
    DT.for.set.sqln  = DT.for.set.sqln(copy(dfN)),
    times = 250L
)

结果摘要

> perf_results
Unit: milliseconds
              expr       min        lq      mean    median        uq      max neval
     hybrid.ifelse 5250.5259 5620.8650 5809.1808 5759.3997 5947.7942 6732.791   250
     dplyr_if_else 3209.7406 3518.0314 3653.0317 3620.2955 3746.0293 4390.888   250
  baseR.sbst.rssgn 1611.9227 1878.7401 1964.6385 1942.8873 2031.5681 2485.843   250
     baseR.replace 1559.1494 1874.7377 1946.2971 1920.8077 2002.4825 2516.525   250
    dplyr_coalesce  949.7511 1231.5150 1279.3015 1288.3425 1345.8662 1624.186   250
 hybrd.rplc_at.nse  735.9949  871.1693 1016.5910 1064.5761 1104.9590 1361.868   250
 hybrd.rplc_at.stw  704.4045  887.4796 1017.9110 1063.8001 1106.7748 1338.557   250
 hybrd.rplc_at.ctn  723.9838  878.6088 1017.9983 1063.0406 1110.0857 1296.024   250
 hybrd.rplc_at.mtc  686.2045  885.8028 1013.8293 1061.2727 1105.7117 1269.949   250
 hybrd.rplc_at.idx  696.3159  880.7800 1003.6186 1038.8271 1083.1932 1309.635   250
     hybrd.rplc_if  705.9907  889.7381 1000.0113 1036.3963 1083.3728 1338.190   250
  tidyr_replace_na  680.4478  973.1395  978.2678 1003.9797 1051.2624 1294.376   250
         baseR.for  670.7897  965.6312  983.5775 1001.5229 1052.5946 1206.023   250
    DT.for.set.nms  496.8031  569.7471  695.4339  623.1086  861.1918 1067.640   250
   DT.for.set.sqln  500.9945  567.2522  671.4158  623.1454  764.9744 1033.463   250

结果Boxplot（以对数为单位）

# adjust the margins to prepare for better boxplot printing
par(mar=c(8,5,1,1) + 0.1) 
# generate boxplot
boxplot(opN, las = 2, xlab = "", ylab = "log(time)[milliseconds]")

试验的彩色编码散点图（以对数表示）

qplot(y=time/10^9, data=opN, colour=expr) + 
    labs(y = "log10 Scaled Elapsed Time per Trial (secs)", x = "Trial Number") +
    scale_y_log10(breaks=c(1, 2, 4))

所有试用时间的散点图

关于其他高绩效员的说明

当数据集变大时，Tidyr的replace_na在历史上一直排在前面。使用当前收集的50M数据点，它的性能几乎与Base R For Loop一样。我很好奇看到不同大小的数据框会发生什么。

有关mutate和summarize _at和_all函数变体的其他示例可以在此处找到：https: _all另外，我在此处找到有用的演示和示例集合：https：// blog.exploratory.io/dplyr-0-5-is-awesome-heres-why-be095fd4eb8a

归因和赞赏

特别感谢：

Tyler Rinker和Akrun展示微基准。

alexis_laz帮助我理解使用local()和（与Frank的耐心帮助一起）沉默压制在加速这些方法中扮演的角色。

ArthurYip为了添加更新的coalesce()函数并更新分析。

格雷戈尔轻轻data.table函数，以便最终将它们包含在该data.table中。

Base R For循环：alexis_laz

data.table对于循环：Matt_Dowle

（当然，如果你发现这些方法有用，请尽快给予他们赞扬。）

关于使用数字的注意事项：如果您确实有纯数据集，则所有函数的运行速度都会更快。请参阅alexiz_laz的工作以获取更多信息。 IRL，我不记得遇到包含10-15％以上整数的数据集，所以我正在全数字数据框上运行这些测试。

链接地址: http://www.djcxy.com/p/24883.html

上一篇: How do I replace NA values with zeros in an R dataframe?

下一篇: How to average score from different elements