Best unit testing tool/method for a hadoop mapreduce job
I am new here, but need to know the best way to do unit testing for programs written over Apache Hadoop. I know we can write unit test cases the jUnit way for the logic inside map and reduce methods. Also we can do the same for other logics involved, but this doesn't guarantee that it is well tested and will work on actual running environment.
I have read about MRUnit, but it too seems to be something like what I have mentioned above, but in a more mature manner. But it too doesn't run as real mapreduce job, but is a mocked one.
Any help would be appreciated.
Thanks.
You certainly have other options. Slight googling and you would have got it yourself. Here I did that for you!
Here is the text I'm pasting from: http://blog.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/
Other than using traditional jUnit and MRUnit, you have following options:
Local Job Runner Testing – Running MR Jobs on a Single Machine in a Single JVM
Traditional unit tests and MRUnit should do a fairly sufficient job detecting bugs early, but neither will test your MR jobs with Hadoop. The local job runner lets you run Hadoop on a local machine, in one JVM, making MR jobs a little easier to debug in the case of a job failing.
To enable the local job runner, set “mapred.job.tracker” to “local” and “fs.default.name” to “file:///some/local/path” (these are the default values).
Remember, there is no need to start any Hadoop daemons when using the local job runner. Running bin/hadoop will start a JVM and will run your job for you. Creating a new hadoop-local.xml file (or mapred-local.xml and hdfs-local.xml if you're using 0.20) probably makes sense. You can then use the –config parameter to tell bin/hadoop which configuration directory to use. If you'd rather avoid fiddling with configuration files, you can create a class that implements Tool and uses ToolRunner, and then run this class with bin/hadoop jar foo.jar com.example.Bar -D mapred.job.tracker=local -D fs.default.name=file:/// (args), where Bar is the Tool implementation.
To start using the local job runner to test your MR jobs in Hadoop, create a new configuration directory that is local job runner enabled and invoke your job as you normally would, remembering to include the –config parameter, which points to a directory containing your local configuration files.
The -conf parameter also works in 0.18.3 and lets you specify your hadoop-local.xml file instead of specifying a directory with –config. Hadoop will run the job happily. The difficulty with this form of testing is verifying that the job ran correctly. Note: you'll have to ensure that input files are set up correctly and output directories don't exist before running the job.
Assuming you've managed to configure the local job runner and get a job running, you'll have to verify that your job completed correctly. Simply basing success on exit codes isn't quite good enough. At the very least, you'll want to verify that the output of your job is correct. You may also want to scan the output of bin/hadoop for exceptions. You should create a script or unit test that sets up preconditions, runs the job, diffs actual output and expected output, and scans for raised exceptions. This script or unit test can then exit with the appropriate status and output specific messages explaining how the job failed.
Note that the local job runner has a couple of limitations: only one reducer is supported, and the DistributedCache doesn't work (a fix is in progress).
Pseudo-distributed Testing – Running MR Jobs on a Single Machine Using Daemons
The local job runner lets you run your job in a single thread. Running an MR job in a single thread is useful for debugging, but it doesn't properly simulate a real cluster with several Hadoop daemons running (eg, NameNode, DataNode, TaskTracker, JobTracker, SecondaryNameNode). A pseudo-distributed cluster is composed of a single machine running all Hadoop daemons. This cluster is still relatively easy to manage (though harder than local job runner) and tests integration with Hadoop better than the local job runner does.
To start using a pseudo-distributed cluster to test your MR jobs in Hadoop, follow the aforementioned advice for using the local job runner, but in your precondition setup include the configuration and start-up of all Hadoop daemons. Then, to start your job, just use bin/hadoop as you would normally.
Full Integration Testing – Running MR Jobs on a QA Cluster
Probably the most thorough yet most cumbersome mechanism for testing your MR jobs is to run them on a QA cluster composed of at least a few machines. By running your MR jobs on a QA cluster, you'll be testing all aspects of both your job and its integration with Hadoop.
Running your jobs on a QA cluster has many of the same issues as the local job runner. Namely, you'll have to check the output of your job for correctness. You may also want to scan the stdin and stdout produced by each task attempt, which will require collecting these logs to a central place and grepping them. Scribe is a useful tool for collecting logs, though it may be superfluous depending on your QA cluster.
We find that most of our customers have some sort of QA or development cluster where they can deploy and test new jobs, try out newer versions of Hadoop, and practice upgrading clusters from one version of Hadoop to another. If Hadoop is a major part of your production pipeline, then creating a QA or development cluster makes a lot of sense, and repeatedly running jobs on it will ensure that changes to your jobs continue to get tested thoroughly. EC2 may be a good host for your QA cluster, as you can bring it up and down on demand. Take a look at our beta EC2 EBS Hadoop scripts if you're interested in creating a QA cluster in EC2.
You should choose QA practices based on the importance of QA for your organization and also on the amount of resources you have. Simply using a traditional unit-testing framework, MRUnit and the local job runner can test your MR jobs thoroughly in a simple way without using too many resources. However, running your jobs on a QA or development cluster is naturally the best way to fully test your MR jobs with the expenses and operational tasks of a Hadoop cluster.
链接地址: http://www.djcxy.com/p/11600.html