Handling huge simulations in R

2018-06-11 02:11:13

I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?

There are few problems and some not so good solutions.

First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.

Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!

Can anyone please suggest a way out! Thanks

This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.

If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.

If possible use sparse matrix techniques

If possible try leveraging storage memory and chunking the object into parts

If possible try to use Big Data tools such as H2O

Leverage multicore and HPC computing with pbdR, parallel, etc

Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can finish quickly and cheaply

Use sampling and statistical solutions when possible

Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually

链接地址: http://www.djcxy.com/p/31864.html

上一篇: 处理大数据集计算R中的SPI

下一篇: 在R中处理巨大的模拟