用最新的非NAs代替NAs

2018-06-08 04:04:18

在data.frame（或data.table）中，我想“填充”具有最近的非NA值的NA。一个简单的例子，使用矢量（而不是data.frame ）如下：

> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

我想要一个函数fill.NAs() ，它允许我构造yy ，使得：

> yy
[1] NA NA NA  2  2  2  2  3  3  3  4  4

我需要对许多（总计data.frame ）小尺寸数据data.frame （~ data.frame ）重复这种操作，其中一行是NA，它的所有条目都是。什么是解决问题的好方法？

我制作的丑陋解决方案使用这个功能：

last <- function (x){
    x[length(x)]
}    

fill.NAs <- function(isNA){
if (isNA[1] == 1) {
    isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                              # can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
    replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
    replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}

函数fill.NAs的用法如下：

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}

产量

> y
[1] NA  2  2  2  2  3  3  3  4  4  4

...似乎工作。但是，男人，真丑！有什么建议么？

您可能希望使用zoo包中的na.locf()函数向前传递最后一个观察值以替换您的NA值。

以下是帮助页面中使用示例的开头部分：

> example(na.locf)

na.lcf> az <- zoo(1:6)

na.lcf> bz <- zoo(c(2,NA,1,4,5,2))

na.lcf> na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.lcf> na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

na.lcf> cz <- zoo(c(NA,9,3,2,3,2))

na.lcf> na.locf(cz)
2 3 4 5 6 
9 3 2 3 2

对于挖掘一个老问题抱歉。我不能在火车上查看这项工作，所以我自己写了一个。

我很自豪地发现它速度要快一点。
虽然它不太灵活。

但它与ave很好，这是我需要的。

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  
xx = rep(x, 1000000)  
system.time({ yzoo = na.locf(xx,na.rm=F)})  
## user  system elapsed   
## 2.754   0.667   3.406   
system.time({ yrep = repeat.before(xx)})  
## user  system elapsed   
## 0.597   0.199   0.793

编辑

由于这成为我最有回报的答案，经常提醒我不要使用我自己的函数，因为我经常需要动物园的maxgap争论。由于动物园在使用无法调试的dplyr +日期时在边缘案例中存在一些奇怪的问题，因此我今天回到了此处以改进我的旧功能。

我在这里测试了我的改进功能和所有其他条目。对于基本的一组特征， tidyr::fill是最快的，同时也不会使边缘案例失败。 @BrandonBertelsen的Rcpp条目仍然更快，但是对于输入的类型它是不灵活的（他由于误解all.equal而错误地测试了边缘案例）。

如果你需要maxgap ，我的下面的函数比动物园快（并且没有日期奇怪的问题）。

我把我的测试文档。

新功能

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

我也把这个函数放在我的formr包中（仅限Github）。

处理大数据量，为了提高效率，我们可以使用data.table包。

require(data.table)
replaceNaWithLatest <- function(
  dfIn,
  nameColNa = names(dfIn)[1]
){
  dtTest <- data.table(dfIn)
  setnames(dtTest, nameColNa, "colNa")
  dtTest[, segment := cumsum(!is.na(colNa))]
  dtTest[, colNa := colNa[1], by = "segment"]
  dtTest[, segment := NULL]
  setnames(dtTest, "colNa", nameColNa)
  return(dtTest)
}

链接地址: http://www.djcxy.com/p/24793.html

上一篇: Replacing NAs with latest non

下一篇: fread data.table in R doesn't read in column names