Fast(er) way of indexing matrix in R

2018-06-05 14:29:21

Foremost, I am looking for a fast(er) way of subsetting/indexing a matrix many, many times over:

for (i in 1:99000) {
  subset.data <- data[index[, i], ]
}

Background:
I'm implementing a sequential testing procedure involving the bootstrap in R. Wanting to replicate some simulation results, I came upon this bottleneck where lots of indexing needs to be done. For implementation of the block-bootstrap I created an index matrix with which I subset the original data matrix to draw resamples of the data.

# The basic setup

B <- 1000 # no. of bootstrap replications
n <- 250  # no. of observations
m <- 100  # no. of models/data series

# Create index matrix with B columns and n rows.
# Each column represents a resampling of the data.
# (actually block resamples, but doesn't matter here).

boot.index <- matrix(sample(1:n, n * B, replace=T), nrow=n, ncol=B)

# Make matrix with m data series of length n.

sample.data <- matrix(rnorm(n * m), nrow=n, ncol=m)

subsetMatrix <- function(data, index) { # fn definition for timing
  subset.data <- data[index, ]
  return(subset.data)
}

# check how long it takes.

Rprof("subsetMatrix.out")
for (i in 1:(m - 1)) { 
  for (b in 1:B) {  # B * (m - 1) = 1000 * 99 = 99000
    boot.data <- subsetMatrix(sample.data, boot.index[, b])
    # do some other stuff
  }
  # do some more stuff
}
Rprof()
summaryRprof("subsetMatrix.out")

# > summaryRprof("subsetMatrix.out")
# $by.self
#              self.time self.pct total.time total.pct
# subsetMatrix      9.96      100       9.96       100

# In the actual application:
#########
# > summaryRprof("seq_testing.out")
# $by.self
#              self.time self.pct total.time total.pct
# subsetMatrix       6.78    53.98       6.78     53.98
# colMeans           1.98    15.76       2.20     17.52
# makeIndex          1.08     8.60       2.12     16.88
# makeStats          0.66     5.25       9.66     76.91
# runif              0.60     4.78       0.72      5.73
# apply              0.30     2.39       0.42      3.34
# is.data.frame      0.22     1.75       0.22      1.75
# ceiling            0.18     1.43       0.18      1.43
# aperm.default      0.14     1.11       0.14      1.11
# array              0.12     0.96       0.12      0.96
# estimateMCS        0.10     0.80      12.56    100.00
# as.vector          0.10     0.80       0.10      0.80
# matrix             0.08     0.64       0.08      0.64
# lapply             0.06     0.48       0.06      0.48
# /                  0.04     0.32       0.04      0.32
# :                  0.04     0.32       0.04      0.32
# rowSums            0.04     0.32       0.04      0.32
# -                  0.02     0.16       0.02      0.16
# >                  0.02     0.16       0.02      0.16
#
# $by.total
#              total.time total.pct self.time self.pct
# estimateMCS        12.56    100.00      0.10     0.80
# makeStats           9.66     76.91      0.66     5.25
# subsetMatrix        6.78     53.98      6.78    53.98
# colMeans            2.20     17.52      1.98    15.76
# makeIndex           2.12     16.88      1.08     8.60
# runif               0.72      5.73      0.60     4.78
# doTest              0.68      5.41      0.00     0.00
# apply               0.42      3.34      0.30     2.39
# aperm               0.26      2.07      0.00     0.00
# is.data.frame       0.22      1.75      0.22     1.75
# sweep               0.20      1.59      0.00     0.00
# ceiling             0.18      1.43      0.18     1.43
# aperm.default       0.14      1.11      0.14     1.11
# array               0.12      0.96      0.12     0.96
# as.vector           0.10      0.80      0.10     0.80
# matrix              0.08      0.64      0.08     0.64
# lapply              0.06      0.48      0.06     0.48
# unlist              0.06      0.48      0.00     0.00
# /                   0.04      0.32      0.04     0.32
# :                   0.04      0.32      0.04     0.32
# rowSums             0.04      0.32      0.04     0.32
# -                   0.02      0.16      0.02     0.16
# >                   0.02      0.16      0.02     0.16
# mean                0.02      0.16      0.00     0.00
#
# $sample.interval
# [1] 0.02
#
# $sampling.time
# [1] 12.56'

Doing the sequential testing procedure once takes about 10 seconds. Using this in simulations with 2500 replications and several parameter constellations, it would take something like 40 days. Using parallel processing and better CPU power it's possible to do faster, but still not very pleasing :/

Is there a better way to resample the data / get rid of the loop?

Can apply, Vectorize, replicate etc. come in anywhere?

Would it make sense to implement the subsetting in C (eg manipulate some pointers)?

Even though every single step is already done incredibly fast by R, it's just not quite fast enough.
I'd be very glad indeed for any kind of response/help/advice!

related Qs:
- Fast matrix subsetting via '[': by rows, by columns or doesn't matter?
- fast function for generating bootstrap samples in matrix forms in R
- random sampling - matrix

from there

mapply(function(row) return(sample.data[row,]), row = boot.index)
replicate(B, apply(sample.data, 2, sample, replace = TRUE))

didn't really do it for me.

I rewrote makeStats and makeIndex as they were two of the biggest bottlenecks:

makeStats <- function(data, index) {

  data.mean <- colMeans(data)
  m <- nrow(data)
  n <- ncol(index)
  tabs <- lapply(1L:n, function(j)tabulate(index[, j], nbins = m))
  weights <- matrix(unlist(tabs), m, n) * (1 / nrow(index))
  boot.data.mean <- t(data) %*% weights - data.mean

  return(list(data.mean = data.mean,
              boot.data.mean = boot.data.mean))
}

makeIndex <- function(B, blocks){

  n <- ncol(blocks)
  l <- nrow(blocks)
  z <- ceiling(n/l)
  start.points <- sample.int(n, z * B, replace = TRUE)
  index <- blocks[, start.points]
  keep <- c(rep(TRUE, n), rep(FALSE, z*l - n))
  boot.index <- matrix(as.vector(index)[keep],
                       nrow = n, ncol = B)

  return(boot.index)
}

This brought down the computation times from 28 to 6 seconds on my machine. I bet there are other parts of the code that can be improved (including my use of lapply/tabulate above.)

链接地址: http://www.djcxy.com/p/17698.html

上一篇: 作曲家不加载我自己的PSR

下一篇: R中索引矩阵的快速（呃）方法