Efficiently construct GRanges/IRanges from Rle vector

I have a Run length encoded vector representing some value at every position on the genome, in order. As a toy example suppose I had just one chromosome of length 10, then I would have a vector looking like

library(GenomicRanges)

set.seed(1)
toyData = Rle(sample(1:3,10,replace=TRUE))

I would like to coerce this into a GRanges object. The best I can come up with is

gr = GRanges('toyChr',IRanges(cumsum(c(0,runLength(toyData)[-nrun(toyData)])),
                              width=runLength(toyData)),
             toyData = runValue(toyData))

which works, but is quite slow. Is there a faster way to construct the same object?


As @TheUnfunCat pointed out, the OP's solution is pretty solid. The solution below is only moderately faster than the original solution. I tried almost every combination of base R and couldn't beat the efficiency Rle from the S4Vectors package, thus I resorted to Rcpp . Here is the main function :

GenomeRcpp <- function(v) {
    x <- WhichDiffZero(v)
    m <- v[c(1L,x+1L)]
    s <- c(0L,x)
    e <- c(x,length(v))-1L
    GRanges('toyChr',IRanges(start = s, end = e), toyData = m)
}

The WhichDiffZero is the Rcpp function that pretty much does the exact same thing as which(diff(v) != 0) in base R . Much credit goes to @G.Grothendieck.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector WhichDiffZero(IntegerVector x) {
    int nx = x.size()-1;
    std::vector<int> y;
    y.reserve(nx);
    for(int i = 0; i < nx; i++) {
        if (x[i] != x[i+1]) y.push_back(i+1);
    }
    return wrap(y);
}

Below are some benchmarks:

set.seed(437)
testData <- do.call(c,lapply(1:10^5, function(x) rep(sample(1:50, 1), sample(1:30, 1))))

microbenchmark(GenomeRcpp(testData), GenomeOrig(testData))
Unit: milliseconds
                expr      min       lq     mean   median       uq      max neval cld
GenomeRcpp(testData) 20.30118 22.45121 26.59644 24.62041 27.28459 198.9773   100   a
GenomeOrig(testData) 25.11047 27.12811 31.73180 28.96914 32.16538 225.1727   100   a

identical(GenomeRcpp(testData), GenomeOrig(testData))
[1] TRUE

I've been working on this off and on for the past few days and I'm definitely not satisfied. I'm hoping that someone will take what I've done (as it is a different approach) and create something much better.

链接地址: http://www.djcxy.com/p/36042.html

上一篇: 代码分割/预加载内容,而用户正在浏览?

下一篇: 从Rle载体高效构建GRanges / IRanges