Replacing NAs with latest non

2018-06-08 04:04:24

In a data.frame (or data.table), I would like to "fill forward" NAs with the closest previous non-NA value. A simple example, using vectors (instead of a data.frame ) is the following:

> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)

I would like a function fill.NAs() that allows me to construct yy such that:

> yy
[1] NA NA NA  2  2  2  2  3  3  3  4  4

I need to repeat this operation for many (total ~1 Tb) small sized data.frame s (~30-50 Mb), where a row is NA is all its entries are. What is a good way to approach the problem?

The ugly solution I cooked up uses this function:

last <- function (x){
    x[length(x)]
}    

fill.NAs <- function(isNA){
if (isNA[1] == 1) {
    isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs 
                                              # can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
    replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)], 
                                which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] - 
                                which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])      
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
    replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])     
    replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}

The function fill.NAs is used as follows:

y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}

Output

> y
[1] NA  2  2  2  2  3  3  3  4  4  4

... which seems to work. But, man, is it ugly! Any suggestions?

You probably want to use the na.locf() function from the zoo package to carry the last observation forward to replace your NA values.

Here is the beginning of its usage example from the help page:

> example(na.locf)

na.lcf> az <- zoo(1:6)

na.lcf> bz <- zoo(c(2,NA,1,4,5,2))

na.lcf> na.locf(bz)
1 2 3 4 5 6 
2 2 1 4 5 2 

na.lcf> na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6 
2 1 1 4 5 2 

na.lcf> cz <- zoo(c(NA,9,3,2,3,2))

na.lcf> na.locf(cz)
2 3 4 5 6 
9 3 2 3 2

Sorry for digging up an old question. I couldn't look up the function to do this job on the train, so I wrote one myself.

I was proud to find out that it's a tiny bit faster.
It's less flexible though.

But it plays nice with ave , which is what I needed.

repeat.before = function(x) {   # repeats the last non NA value. Keeps leading NA
    ind = which(!is.na(x))      # get positions of nonmissing values
    if(is.na(x[1]))             # if it begins with a missing, add the 
          ind = c(1,ind)        # first position to the indices
    rep(x[ind], times = diff(   # repeat the values at these indices
       c(ind, length(x) + 1) )) # diffing the indices + length yields how often 
}                               # they need to be repeated

x = c(NA,NA,'a',NA,NA,NA,NA,NA,NA,NA,NA,'b','c','d',NA,NA,NA,NA,NA,'e')  
xx = rep(x, 1000000)  
system.time({ yzoo = na.locf(xx,na.rm=F)})  
## user  system elapsed   
## 2.754   0.667   3.406   
system.time({ yrep = repeat.before(xx)})  
## user  system elapsed   
## 0.597   0.199   0.793

Edit

As this became my most upvoted answer, I was reminded often that I don't use my own function, because I often need zoo's maxgap argument. Because zoo has some weird problems in edge cases when I use dplyr + dates that I couldn't debug, I came back to this today to improve my old function.

I benchmarked my improved function and all the other entries here. For the basic set of features, tidyr::fill is fastest while also not failing the edge cases. The Rcpp entry by @BrandonBertelsen is faster still, but it's inflexible regarding the input's type (he tested edge cases incorrectly due to a misunderstanding of all.equal ).

If you need maxgap , my function below is faster than zoo (and doesn't have the weird problems with dates).

I put up the documentation of my tests.

new function

repeat_last = function(x, forward = TRUE, maxgap = Inf, na.rm = FALSE) {
    if (!forward) x = rev(x)           # reverse x twice if carrying backward
    ind = which(!is.na(x))             # get positions of nonmissing values
    if (is.na(x[1]) && !na.rm)         # if it begins with NA
        ind = c(1,ind)                 # add first pos
    rep_times = diff(                  # diffing the indices + length yields how often
        c(ind, length(x) + 1) )          # they need to be repeated
    if (maxgap < Inf) {
        exceed = rep_times - 1 > maxgap  # exceeding maxgap
        if (any(exceed)) {               # any exceed?
            ind = sort(c(ind[exceed] + 1, ind))      # add NA in gaps
            rep_times = diff(c(ind, length(x) + 1) ) # diff again
        }
    }
    x = rep(x[ind], times = rep_times) # repeat the values at these indices
    if (!forward) x = rev(x)           # second reversion
    x
}

I've also put the function in my formr package (Github only).

处理大数据量，为了提高效率，我们可以使用data.table包。

require(data.table)
replaceNaWithLatest <- function(
  dfIn,
  nameColNa = names(dfIn)[1]
){
  dtTest <- data.table(dfIn)
  setnames(dtTest, nameColNa, "colNa")
  dtTest[, segment := cumsum(!is.na(colNa))]
  dtTest[, colNa := colNa[1], by = "segment"]
  dtTest[, segment := NULL]
  setnames(dtTest, "colNa", nameColNa)
  return(dtTest)
}

链接地址: http://www.djcxy.com/p/24794.html

上一篇: 在de中引发异常

下一篇: 用最新的非NAs代替NAs