Building very large Cosine matrix

I need to build a cosine matrix (ie a matrix of cosine distances between every vector combination) for a vector set with 89,000 vectors of length 500, leading to a final 89,000x89,000 matrix. My current approach seems to be very inefficient, leading to very long processing times (eg using a vector set with 52,000 vectors of length 500 takes ~36 hours to build a 52,000x52,000 matrix).

My current solution uses R version 3.0.1 (2013-05-16), running on a 64bit version of ubuntu 13.10 on an Intel Core i7 4960X CPU @ 3.60GHz x 12 platform with 64GB RAM. Despite my using a 64-bit system, I still run into vector length errors thrown back from native sub-functions in R (eg Error: ... Too many indices (>2^31-1) for extraction); there does not seem to be a fix for that problem. As such, my current solution uses the big.matrix objects from the bigmemory package. I am also making use of the doParallel package to utilize all 12 processor cores on my workstation.

This is the code I am currently using:

setSize <- nrow(vectors_gw2014_FREQ_csMns) #i.e. =89,095

COSmatrix <- filebacked.big.matrix(
        #set dimensions and element value type
        setSize, setSize, init=0,
        type="double",
        backingpath = './COSmatrices',
        backingfile    = "cosMAT_gw2014_VARppmi.bak", 
        descriptorfile = "cosMAT_gw2014_VARppmi.dsc" 
        )

#initialize progress bar
pb <- txtProgressBar(min = 0, max = setSize, style = 3)
feErr <- foreach(i=1:setSize) %dopar%  {
    COSmatrix <- attach.big.matrix("./COSmatrices/cosMAT_gw2014_FREQ_csMns.dsc")
    setTxtProgressBar(pb, i)
    for (j in 1:setSize)
    {   
        if (j < i) 
        {
            COSmatrix[i,j] <- cosine(   as.vector(vectors_gw2014_FREQ_csMns[i,],mode="numeric"),
                                        as.vector(vectors_gw2014_FREQ_csMns[j,],mode="numeric") )

            COSmatrix[j,i] <- COSmatrix[i,j]

        }
        else break
    }#FOR j
}#FOREACH DOPAR i
close(pb)

I suspect that the main problem with my code—ie leading to excessive processing time—is the call to re-attach the big.matrix object in each iteration of the main foreach-loop:

COSmatrix <- attach.big.matrix("./COSmatrices/cosMAT_gw2014_FREQ_csMns.dsc")

However, this seems to be necessary in order to have access to a big.matrix object within a FOREACH (ie parallel processing feature from doparallel package) loop; without this line of code in the main loop, the COSmatrix object is inaccessible (see Using big.matrix in foreach loops).

I am looking for any and all suggestions for streamlining this process and cutting the processing time down from days to hours. This means I am open to using other approaches, either within R (ie using alternatives to the bigmemory package), or with a completely different toolset (ie python or C++ code). Please bear in mind that many (most?) of the commonly used R functions will not work with matrices of this size; I have explored many promising avenues only to run into the long vectors 32/64-bit limitation (ie Error: ... Too many indices (>2^31-1) for extraction; see Max Length for a Vector in R).

Cheers!

链接地址: http://www.djcxy.com/p/31860.html

上一篇: 大矩阵和内存问题

下一篇: 构建非常大的余弦矩阵