Monkey patching R language base functions with big data functions for speed
It all started with an R package I needed to use badly ( 'nlt' ), which has 2 other (quite large) package dependencies ( 'adlift' , 'ebayesthresh' ). I needed it to analyze a data sample of about 4000 points.
The algorithms create many 'hidden' vectors, so even though at first glance you'd think you have enough memory to load the data sample and process it, things turn sour fast. At this point I should mention I have both Ubuntu x64 and Windows x64 at my disposal with 4GB of RAM.
Out of sheer curiosity and masochism I guess, I decided to give it a try on an Amazon EC2 instance. I ended up trying several of them and I stopped at the High-Memory Extra Large Instance 17.1 GB memory with 6.5 ECUs, when again, I ran out of memory and Ubuntu killed my running function.
I ended up using the split-apply-combine approach with 'snowall' , 'foreach' and 'doSMP' . I chuncked my data, processed each chunck and combined the results. Thank heavens lapply and sfLapply exist. The sample was analysed in under 7 minutes on my laptop.
I guess I should be happy, but 7 minutes is still a lot and I'd like not to have to jump the gun to Amazon EC2 again, unless there really is no other thing left to shorten run time.
I did some research, the 'bigmemory' and 'ff' packages for R seem to allow considerable speed-ups, especially if I use filebacked data.
The 'nlt' package only takes vectors as input, and 'bigmemory' for instance has its special data type, the big.matrix. Even if I'd magically be able to feed big.matrixes to the 'nlt' package, this still leaves the many new vector allocations with standard R functions that are hard-coded into the package and it's dependencies.
I keep thinking of aspect-oriented programming / monkey-patching and I managed to find apparently the only R package for such a thing, 'r-connect' .
Now, as I see it, I have 2 main options:
Am I jumping the shark? Can anyone else propose another solution or share similar experiences?
Another option would be to profile those 3 packages' memory use and remove any redundant data and remove objects when they're no longer needed.
UPDATE:
nlt
isn't too complicated; it mostly wraps adlift
and EbayesThresh
functions, so I would take a look at those two packages.
Take adlift/R/Amatdual.R for example: Adual
and Hdual
are initialized at the beginning of the Amatdual
function, but they're never indexed in the function; they're completely re-created later.
Adual <- matrix(0, n - steps + 1, n - steps + 1)
Hdual <- matrix(0, n - steps, n - steps + 1)
...
Hdual <- cbind(diag(length(newpoints) - 1), lastcol)
...
Adual <- rbind(Hdual, as.row(Gdual))
There's no need for those two initial allocations.
adlift
and nlt
also have several uses of apply
that could be switched to row/col Means/Sums. I'm not sure how much this would help with memory usage, but it would be faster. Ie:
apply(foo, 1, sum) # same as rowSums(foo)
apply(foo, 2, sum) # same as colSums(foo)
apply(foo, 1, mean) # same as rowMeans(foo)
apply(foo, 2, mean) # same as colMeans(foo)
链接地址: http://www.djcxy.com/p/5468.html
上一篇: 在Java问题中实现概率分布函数