#hadoop, #tutorial, #beginners, #overview, #big #data #overview, #big #bata #solutions, #introduction #to #hadoop, #enviornment #setup, #hdfs #overview, #hdfs #operations, #command #reference, #mapreduce, #streaming, #multi #node #cluster.
Hadoop – MapReduce
MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model.
Generally MapReduce paradigm is based on sending the computer to where the data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage. The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data.
Reduce stage. This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on key, value pairs, that is, the framework views the input to the job as a set of key, value pairs and produces a set of key, value pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence, need to implement the Writable interface. Additionally, the key classes have to implement the Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a MapReduce job: (Input) k1, v1 – map – k2, v2 – reduce – k3, v3 (Output).
If the above data is given as input, we have to write applications to process it and produce results such as finding the year of maximum usage, year of minimum usage, and so on. This is a walkover for the programmers with finite number of records. They will simply write the logic to produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the largescale industries of a particular state, since its formation.
When we write applications to process such bulk data,
- They will take a lot of time to execute.
- There will be a heavy network traffic when we move data from source to network server and so on.
To solve these problems, we have the MapReduce framework.
The above data is saved as sample.txt and given as input. The input file looks as shown below.
Given below is the program to the sample data using MapReduce framework.
Save the above program as ProcessUnits.java. The compilation and execution of the program is explained below.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
The following command is to create a directory to store the compiled java classes.
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce program. Visit the following link http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1 to download the jar. Let us assume the downloaded folder is /home/hadoop/.
The following commands are used for compiling the ProcessUnits.java program and creating a jar for the program.
The following command is used to create an input directory in HDFS.
The following command is used to copy the input file named sample.txt in the input directory of HDFS.
The following command is used to verify the files in the input directory.
The following command is used to run the Eleunit_max application by taking the input files from the input directory.
Wait for a while until the file is executed. After execution, as shown below, the output will contain the number of input splits, the number of Map tasks, the number of reducer tasks, etc.
The following command is used to verify the resultant files in the output folder.
The following command is used to see the output in Part-00000 file. This file is generated by HDFS.
Below is the output generated by the MapReduce program.
The following command is used to copy the output folder from HDFS to the local file system for analyzing.
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the Hadoop script without any arguments prints the description for all commands.
Usage. hadoop [–config confdir] COMMAND
The following table lists the options available and their description.