Using AWS for parallel processing with R
I want to take a shot at the Kaggle Dunnhumby challenge by building a model for each customer. I want to split the data into ten groups and use Amazon web-services (AWS) to build models using R on the ten groups in parallel. Some relevant links I have come across are:
What I don't understand is:
I would be very grateful if you could share suggestions and hints to point me in the right direction.
PS I am using the free usage account on AWS but it was very difficult to install R from source on the Amazon Linux AMIs (lots of errors due to missing headers, libraries and other dependencies).
You can build up everything manually at AWS. You have to build your own amazon computer cluster with several instances. There is a nice tutorial video available at the Amazon website: http://www.youtube.com/watch?v=YfCgK1bmCjw
But it will take you several hours to get everything running:
The segue package is nice but you will definitely get data communication problems!
The simples solution is cloudnumbers.com (http://www.cloudnumbers.com). This platform provides you with easy access to computer clusters in the cloud. You can test 5 hours for free with a small computer cluster in the cloud! Check the slides from the useR conference: http://cloudnumbers.com/hpc-news-from-the-user2011-conference
I'm not sure I can answer the question about which method to use, but I can explain how I would think about the question. I'm the author of Segue so keep that bias in mind :)
A few questions I would answer BEFORE I started trying to figure out how to get AWS (or any other system) running:
Just glancing at the training data, it doesn't look that big (~280 MB). So this isn't really a "big data" problem. If your models take a long time to create, it might be a "big CPU" problem, which Segue may, or may not, be a good tool to help you solve.
In answer to your specific question about how to get the data onto AWS, Segue does this by serializing the list object you provide to the emrlapply() command, uploading the serialized object to S3, then using the Elastic Map Reduce service to stream the object through Hadoop. But as a user of Segue you don't need to know that. You just need to call emrlapply() and pass it your list data (probably a list where each element is a matrix or data frame of a single shopper's data) and a function (one you write to fit the model you choose) and Segue takes care of the rest. But keep in mind that the very first thing Segue does when you call emrlapply() is to serialize (sometimes slowly) and upload your data to S3. So depending on the size of the data and the speed of your internet connection upload speeds, this can be slow. I take issues with Markus' assertion that you will "definitely get data communication problems". That's clearly FUD. I use Segue on stochastic simulations that send/receive 300MB/1GB with some regularity. But I tend to run these simulations from an AWS instance so I am sending and receiving from one AWS rack to another, which makes everything much faster.
If you're wanting to do some analysis on AWS and get your feet wet with R in the cloud, I recommend Drew Conway's AMI for Scientific Computing. Using his AMI will save you from having to install/build much. To upload data to your running machine, once you set up your ssh certificates, you can use scp to upload files to your instance.
I like running RStudio on my Amazon instances. This will require setting up password access to your instance. There are a lot of resources around for helping with this.
链接地址: http://www.djcxy.com/p/53420.html上一篇: R连接到EC2实例进行并行处理
下一篇: 使用AWS与R并行处理