After a certain point, my R program suddenly slows to a crawl. Why?

2018-06-11 02:13:17

I'm teaching myself R. The details of the program may not matter, but it constructs a set of nodes (with eight parameters each, all integers) and edges (four integer parameters each). Initially, I tried storing these as lists of lists. However, when I tried to construct a very large set of nodes, the program went smoothly until there were about 30,000 nodes and 64,000 edges, and then slowed very abruptly to a crawl.

I thought the problem might involve available memory in some way (although the lists took up 19.3MB and 39.5MB respectively). I increased the paging file size from 8097 to 12145 MB and set memory.limit at 12145, with no effect. I also tried storing the data in different structures: in a matrix of integers, in a data.frame, and in a big.matrix (from the package "bigmemory"). Regardless of the data structure, the program slows to a crawl at the same point. It doesn't stop completely-- it just slows down by an order of magnitude or two.

Could someone more experienced than I am explain briefly why slowdowns like this happen? Is there more than one possible reason that I should be looking into? In particular, I'm puzzled by the sudden change in speed. The code seems to run at constant speed until it hits the "wall" I've described above.

I've combed through several other posts on Stackoverflow where people complain about programs slowing down, but those all seem to involve memory issues, which (I think???) I've ruled out.

I've posted the code below, but it's long and convoluted, probably too long for anyone to riddle through. Here are some possibly salient facts about it:

The algorithm is O(n), or nearly so.

There's no actual recursion. The core of the program is a while loop that calls one of five functions; that function returns the identity of the function that needs to be called next, and the while loop then calls that function.

Each of the five functions can: add one or two edges or nodes; alter the parameters for one or two edges or nodes; and/or change a handful of global parameters that govern what the algorithm does.

Here's the code itself, in case the details are important. This is the big.matrix version, but all of the other versions have the same basic structure. It's a weird implementation of Ukkonen's algorithm for building a suffix tree. I apologize for posting such a long chunk of code.

library(bigmemory)


clear <- function (){
  n <<- big.matrix(150000,8, type = "integer", init = 0L)
  colnames(n) <<- c("A", "C", "G", "T", "$", "root", "link", "distance")
  e <<- big.matrix(200000,4, type = "integer", init = 0L)
  colnames(e) <<- c("from","to","start","stop")
}

AT_NODE <- function(){
  temp1 <- n[a_node, u(curr_idx)]
  if (temp1==0){
    return (INSERT_BRANCH)
  }
  temp2 <- e[temp1,"stop"] - e[temp1,"start"] + 1L
  if (min_dist >= temp2){
    if (length(e[temp1,"to"]) == 0){
      a_node <<- 1L
      curr_idx <<- curr_suffix <<- curr_suffix + 1L
      return (AT_NODE)
    } else {
    min_dist <<- min_dist - temp2
    curr_idx <<- curr_idx + temp2
    a_node <<- e[temp1,"to"]
    return (AT_NODE)
    }
  } else {
    a_pos <<- a_pos + min_dist
    curr_idx <<- curr_idx + min_dist
    a_edge <<- temp1
    return (STEP_FORWARD)
  }
}

INSERT_BRANCH <- function(){
  e[e_ptr+1,] <<- c(a_node, 0L, curr_idx, as.integer(nchar(strg)))
  e_ptr <<- e_ptr+1L
  n[a_node, u(curr_idx)] <<- e_ptr
  return (SUFFIX_JUMP)
}

STEP_FORWARD <- function(){
  if (u(curr_idx) == u(e[a_edge,"start"] + a_pos - 1L)){
   a_pos <<- a_pos + 1L
    curr_idx <<- curr_idx + 1L
    if (a_pos > (e[a_edge,"stop"] - e[a_edge,"start"] + 1L)){
      a_pos <<- 1L
      if (curr_idx > nchar(strg)){
        a_node <<- 1L
        curr_idx <<- curr_suffix <<- curr_suffix + 1L
        return (AT_NODE)
      } else {
        a_node <<- e[a_edge,"to"]
        min_dist <<- max (0L, (min_dist - (e[a_edge,"stop"] - e[a_edge,"start"] + 1L)))
        return (AT_NODE)
      }
    } else {
      return (STEP_FORWARD)
    }
  } else {
    return (ADD_NODE)
  }
}

ADD_NODE <- function(){
  min_dist <<- n[e[a_edge,"from"],"distance"] + a_pos - 1L
  n_ptr <<- n_ptr + 1L
  n[n_ptr,u(curr_idx)] <<- e_ptr + 1L
  n[n_ptr,u(e[a_edge, "start"] + a_pos - 1L)] <<- e_ptr + 2L
  n[n_ptr, "root"] <<- a_edge
  n[n_ptr, "distance"] <<- min_dist
  e[e_ptr + 1L,] <<- c(n_ptr, 0L, curr_idx, as.integer(nchar(strg)))
  e[e_ptr + 2L,] <<- c(n_ptr, e[a_edge,"to"], e[a_edge, "start"] + a_pos - 1L, e[a_edge,"stop"])
  e[a_edge,"to"] <<- n_ptr
  e[a_edge,"stop"] <<- e[a_edge,"start"] + a_pos - 2L
  e_ptr <<- e_ptr + 2L

  if (a_node == 1){
    min_dist <<- max(0L, min_dist - 1L)
  }
  min_dist <<- max(0L, min(min_dist, (curr_idx - curr_suffix - n[a_node,"distance"])))
  a_node <<- n_ptr
  if (e[n[a_node,"root"],"from"] == 1 & (e[n[a_node,"root"],"stop"] - e[n[a_node,"root"],"start"]) == 0){
    n[a_node,"link"] <<- 1L
    suffix_link <<- 0L
  }
  return (SUFFIX_JUMP)
}

SUFFIX_JUMP <- function(){
  if (suffix_link != 0L){
    n[suffix_link,"link"] <<- a_node
  }
  if (a_node > 1){
    suffix_link <<- a_node
  }
  a_pos <<- 1L
  curr_idx <<- curr_suffix <<- curr_suffix + 1L
  temp1 <- n[orig_node,"link"]
  if (temp1 > 1){
    a_node <<- orig_node <<- temp1
    curr_idx <<- curr_idx + n[a_node,"distance"]
    return (AT_NODE)
  }
  min_dist <<- max(0L, min_dist -1L)
  orig_node <<- a_node <<- 1L
  #print ("headed to root")
  return (AT_NODE)
}

Ukkonen <- function (strg){
  strg <<- paste0(c(strg, "$"), collapse = "")
  action <- AT_NODE

  while (curr_suffix < (nchar(strg)+2)){
    temp1 <- do.call(action, list())
    action <- temp1
  }
}

strg <- <<a string of 89,000 nucleotides>>
a_edge <- min_dist <- suffix_link <- e_ptr <- 0L
curr_suffix <- curr_idx <- a_pos <- a_node <- n_ptr <- orig_node <- 1L
clear()
Ukkonen(strg)

链接地址: http://www.djcxy.com/p/31868.html

上一篇: 如何处理R igraph中的非常大的双峰边界列表？

下一篇: 在某一点之后，我的R程序突然变慢了。为什么？