How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez?

2018-06-09 03:18:30

As the github page of tez says, tez is very simple and at its heart has just two components:

The data-processing pipeline engine, and

A master for the data-processing application, where-by one can put together arbitrary data-processing 'tasks' described above into a task-DAG

Well my first question is, How existing mapreduce jobs like wordcount that exists in tez-examples.jar, converted to task-DAG? where? or they don't...?

and my second and more important question is about this part:

Every 'task' in tez has the following:

Input to consume key/value pairs from.

Processor to process them.

Output to collect the processed key/value pairs.

Who is in charge of splitting input data between the tez-tasks? Is it the code that user provide or is it Yarn (the resource manager) or even the tez itself?

The question is the same for output phase. Thanks in advance

To answer your first question on converting MapReduce jobs to Tez DAGs:

Any MapReduce job can be thought of a single DAG with 2 vertices(stages). The first vertex is the Map phase and it is connected to a downstream vertex Reduce via a Shuffle edge.

There are 2 ways in which MR jobs can be run on Tez:

One approach is to write a native 2-stage DAG using the Tez APIs directly. This is what is currently present in tez-examples.

The second is to use the MapReduce APIs themselves and use the yarn-tez mode. In this scenario, there is a layer which intercepts the MR Job submission and instead of using MR, it translates the MR job into a 2-stage Tez DAG and executes the DAG on the Tez runtime.

For the data handling related questions that you have:

The user provides the logic on understanding the data to be read and how to split it. Tez then takes each split of data and takes over the responsibility of assigning a split or a set of splits to a given task.

The Tez framework then controls the generation and movement of data ie where to generate the data between intermediate steps and how to move data between 2 vertices/stages. However, it does not control the underlying data contents/structure, partitioning or serialization logic which is provided by user plugins.

The above is just a high level view with additional intricacies. You will get more detailed answers by posting specific questions to the Development list ( http://tez.apache.org/mail-lists.html )

链接地址: http://www.djcxy.com/p/27468.html

上一篇: 有时无法使用火花打开本机连接

下一篇: 如何使用apache tez在hadoop纱线集群上运行mapCreduce作业？