After a certain point, my R program suddenly slows to a crawl. Why?
I'm teaching myself R. The details of the program may not matter, but it constructs a set of nodes (with eight parameters each, all integers) and edges (four integer parameters each). Initially, I tried storing these as lists of lists. However, when I tried to construct a very large set of nodes, the program went smoothly until there were about 30,000 nodes and 64,000 edges, and then slowed very abruptly to a crawl.
I thought the problem might involve available memory in some way (although the lists took up 19.3MB and 39.5MB respectively). I increased the paging file size from 8097 to 12145 MB and set memory.limit at 12145, with no effect. I also tried storing the data in different structures: in a matrix of integers, in a data.frame, and in a big.matrix (from the package "bigmemory"). Regardless of the data structure, the program slows to a crawl at the same point. It doesn't stop completely-- it just slows down by an order of magnitude or two.
Could someone more experienced than I am explain briefly why slowdowns like this happen? Is there more than one possible reason that I should be looking into? In particular, I'm puzzled by the sudden change in speed. The code seems to run at constant speed until it hits the "wall" I've described above.
I've combed through several other posts on Stackoverflow where people complain about programs slowing down, but those all seem to involve memory issues, which (I think???) I've ruled out.
I've posted the code below, but it's long and convoluted, probably too long for anyone to riddle through. Here are some possibly salient facts about it:
Here's the code itself, in case the details are important. This is the big.matrix version, but all of the other versions have the same basic structure. It's a weird implementation of Ukkonen's algorithm for building a suffix tree. I apologize for posting such a long chunk of code.
library(bigmemory)
clear <- function (){
n <<- big.matrix(150000,8, type = "integer", init = 0L)
colnames(n) <<- c("A", "C", "G", "T", "$", "root", "link", "distance")
e <<- big.matrix(200000,4, type = "integer", init = 0L)
colnames(e) <<- c("from","to","start","stop")
}
AT_NODE <- function(){
temp1 <- n[a_node, u(curr_idx)]
if (temp1==0){
return (INSERT_BRANCH)
}
temp2 <- e[temp1,"stop"] - e[temp1,"start"] + 1L
if (min_dist >= temp2){
if (length(e[temp1,"to"]) == 0){
a_node <<- 1L
curr_idx <<- curr_suffix <<- curr_suffix + 1L
return (AT_NODE)
} else {
min_dist <<- min_dist - temp2
curr_idx <<- curr_idx + temp2
a_node <<- e[temp1,"to"]
return (AT_NODE)
}
} else {
a_pos <<- a_pos + min_dist
curr_idx <<- curr_idx + min_dist
a_edge <<- temp1
return (STEP_FORWARD)
}
}
INSERT_BRANCH <- function(){
e[e_ptr+1,] <<- c(a_node, 0L, curr_idx, as.integer(nchar(strg)))
e_ptr <<- e_ptr+1L
n[a_node, u(curr_idx)] <<- e_ptr
return (SUFFIX_JUMP)
}
STEP_FORWARD <- function(){
if (u(curr_idx) == u(e[a_edge,"start"] + a_pos - 1L)){
a_pos <<- a_pos + 1L
curr_idx <<- curr_idx + 1L
if (a_pos > (e[a_edge,"stop"] - e[a_edge,"start"] + 1L)){
a_pos <<- 1L
if (curr_idx > nchar(strg)){
a_node <<- 1L
curr_idx <<- curr_suffix <<- curr_suffix + 1L
return (AT_NODE)
} else {
a_node <<- e[a_edge,"to"]
min_dist <<- max (0L, (min_dist - (e[a_edge,"stop"] - e[a_edge,"start"] + 1L)))
return (AT_NODE)
}
} else {
return (STEP_FORWARD)
}
} else {
return (ADD_NODE)
}
}
ADD_NODE <- function(){
min_dist <<- n[e[a_edge,"from"],"distance"] + a_pos - 1L
n_ptr <<- n_ptr + 1L
n[n_ptr,u(curr_idx)] <<- e_ptr + 1L
n[n_ptr,u(e[a_edge, "start"] + a_pos - 1L)] <<- e_ptr + 2L
n[n_ptr, "root"] <<- a_edge
n[n_ptr, "distance"] <<- min_dist
e[e_ptr + 1L,] <<- c(n_ptr, 0L, curr_idx, as.integer(nchar(strg)))
e[e_ptr + 2L,] <<- c(n_ptr, e[a_edge,"to"], e[a_edge, "start"] + a_pos - 1L, e[a_edge,"stop"])
e[a_edge,"to"] <<- n_ptr
e[a_edge,"stop"] <<- e[a_edge,"start"] + a_pos - 2L
e_ptr <<- e_ptr + 2L
if (a_node == 1){
min_dist <<- max(0L, min_dist - 1L)
}
min_dist <<- max(0L, min(min_dist, (curr_idx - curr_suffix - n[a_node,"distance"])))
a_node <<- n_ptr
if (e[n[a_node,"root"],"from"] == 1 & (e[n[a_node,"root"],"stop"] - e[n[a_node,"root"],"start"]) == 0){
n[a_node,"link"] <<- 1L
suffix_link <<- 0L
}
return (SUFFIX_JUMP)
}
SUFFIX_JUMP <- function(){
if (suffix_link != 0L){
n[suffix_link,"link"] <<- a_node
}
if (a_node > 1){
suffix_link <<- a_node
}
a_pos <<- 1L
curr_idx <<- curr_suffix <<- curr_suffix + 1L
temp1 <- n[orig_node,"link"]
if (temp1 > 1){
a_node <<- orig_node <<- temp1
curr_idx <<- curr_idx + n[a_node,"distance"]
return (AT_NODE)
}
min_dist <<- max(0L, min_dist -1L)
orig_node <<- a_node <<- 1L
#print ("headed to root")
return (AT_NODE)
}
Ukkonen <- function (strg){
strg <<- paste0(c(strg, "$"), collapse = "")
action <- AT_NODE
while (curr_suffix < (nchar(strg)+2)){
temp1 <- do.call(action, list())
action <- temp1
}
}
strg <- <<a string of 89,000 nucleotides>>
a_edge <- min_dist <- suffix_link <- e_ptr <- 0L
curr_suffix <- curr_idx <- a_pos <- a_node <- n_ptr <- orig_node <- 1L
clear()
Ukkonen(strg)
链接地址: http://www.djcxy.com/p/31868.html