Building very large Cosine matrix
I need to build a cosine matrix (ie a matrix of cosine distances between every vector combination) for a vector set with 89,000 vectors of length 500, leading to a final 89,000x89,000 matrix. My current approach seems to be very inefficient, leading to very long processing times (eg using a vector set with 52,000 vectors of length 500 takes ~36 hours to build a 52,000x52,000 matrix).
My current solution uses R version 3.0.1 (2013-05-16), running on a 64bit version of ubuntu 13.10 on an Intel Core i7 4960X CPU @ 3.60GHz x 12 platform with 64GB RAM. Despite my using a 64-bit system, I still run into vector length errors thrown back from native sub-functions in R (eg Error: ... Too many indices (>2^31-1) for extraction); there does not seem to be a fix for that problem. As such, my current solution uses the big.matrix objects from the bigmemory package. I am also making use of the doParallel package to utilize all 12 processor cores on my workstation.
This is the code I am currently using:
setSize <- nrow(vectors_gw2014_FREQ_csMns) #i.e. =89,095
COSmatrix <- filebacked.big.matrix(
#set dimensions and element value type
setSize, setSize, init=0,
type="double",
backingpath = './COSmatrices',
backingfile = "cosMAT_gw2014_VARppmi.bak",
descriptorfile = "cosMAT_gw2014_VARppmi.dsc"
)
#initialize progress bar
pb <- txtProgressBar(min = 0, max = setSize, style = 3)
feErr <- foreach(i=1:setSize) %dopar% {
COSmatrix <- attach.big.matrix("./COSmatrices/cosMAT_gw2014_FREQ_csMns.dsc")
setTxtProgressBar(pb, i)
for (j in 1:setSize)
{
if (j < i)
{
COSmatrix[i,j] <- cosine( as.vector(vectors_gw2014_FREQ_csMns[i,],mode="numeric"),
as.vector(vectors_gw2014_FREQ_csMns[j,],mode="numeric") )
COSmatrix[j,i] <- COSmatrix[i,j]
}
else break
}#FOR j
}#FOREACH DOPAR i
close(pb)
I suspect that the main problem with my code—ie leading to excessive processing time—is the call to re-attach the big.matrix object in each iteration of the main foreach-loop:
COSmatrix <- attach.big.matrix("./COSmatrices/cosMAT_gw2014_FREQ_csMns.dsc")
However, this seems to be necessary in order to have access to a big.matrix object within a FOREACH (ie parallel processing feature from doparallel package) loop; without this line of code in the main loop, the COSmatrix object is inaccessible (see Using big.matrix in foreach loops).
I am looking for any and all suggestions for streamlining this process and cutting the processing time down from days to hours. This means I am open to using other approaches, either within R (ie using alternatives to the bigmemory package), or with a completely different toolset (ie python or C++ code). Please bear in mind that many (most?) of the commonly used R functions will not work with matrices of this size; I have explored many promising avenues only to run into the long vectors 32/64-bit limitation (ie Error: ... Too many indices (>2^31-1) for extraction; see Max Length for a Vector in R).
Cheers!
链接地址: http://www.djcxy.com/p/31860.html上一篇: 大矩阵和内存问题
下一篇: 构建非常大的余弦矩阵