How to Run a Hadoop MapReduce Program on Ubuntu 16.04?

Linoxide 2016-12-20 Channel: Linux

Abstract: hadoop jar wordcount.jar /input /output hadoop@hadoop-VirtualBox~$ We have our input file onto HDFS for wordcount program. hadoop@hadoop-VirtualBox

Linux . How-To-Install-Nginx-MySQL-PHP-(LEMP-Stack)-on-Ubuntu-1604-LTS

Linux . How-to-Install-Nginx-MySQL--PHP-(LEMP)-on-Ubuntu-2004

Linux . How-to-Install-Kubernetes-(k8s)-Cluster-on-Ubuntu-2004

In this blog, I will show you how to run a MapReduce program. MapReduce is one of the core part of Apache Hadoop, it is the processing layer of Apache Hadoop. So before I show you how to run a MapReduce program, let me briefly explain you MapReduce.

MapReduce is a system for parallel processing of large data sets. MapReduce reduces the data into results and creates a summary of the data. A mapreduce program has two parts - mapper and reducer. After the mapper finishes its work then only reducers start.

Mapper : It maps input key/value pairs to a set of intermediate key/value pairs.

k-means clustering in Python [with ...

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

k-means clustering in Python [with example]

Reducer : It reduces a set of intermediate values which share a key to a smaller set of values.

Basically, in the wordcount mapreduce program, we provide input file(s) - any text file, as input. When the mapreduce program starts, below are the processes it goes through:

Splitting : It splits the each line in the input file into words.

Mapping : It forms a key value pair, where word is the key and 1 is the value assigned to each key.

Shuffling : Common key value pairs get grouped together.

Reducing : The values of similar keys are added together.

Running MapReduce Program

A MapReduce program is written in Java. And mostly Eclipse IDE is used for programming by the developers. So in this blog, I will show you how to export a mapreduce program into a jar file from Eclipse IDE and run it on a Hadoop cluster.

My MapReduce program is there in my Eclipse IDE.

Now to run this MapReduce program on a hadoop cluster, we will export the project as a jar file. Select File option in eclipse ide and click on Export. In Java option, select Jar file and click on Next.

Select the Wordcount project and give the path and name for the jar file, I am keeping it wordcount.jar, Click on Next twice.

Now Click on Browse and select the main class and finally click on Finish to make the jar file. In case you get any warning as below, just click OK.

Check whether your Hadoop cluster is up and working or not.

Command: jps

hadoop@hadoop-VirtualBox:~$ jps

3008 NodeManager

3924 Jps

2885 ResourceManager

2505 DataNode

3082 JobHistoryServer

2716 SecondaryNameNode

2383 NameNode

hadoop@hadoop-VirtualBox:~$

We have our input file onto HDFS for wordcount program.

hadoop@hadoop-VirtualBox:~$ hdfs dfs -put input /

hadoop@hadoop-VirtualBox:~$ hdfs dfs -cat /input

This is my first mapreduce test

This is wordcount program

hadoop@hadoop-VirtualBox:~$

Now run the wordcount.jar file using below command.

Note: Since we selected the main class while exporting wordcount.jar , so no need to mention the main class in the command.

Command: hadoop jar wordcount.jar /input /output

hadoop@hadoop-VirtualBox:~$ hadoop jar wordcount.jar /input /output

16/11/27 22:52:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:
8032

16/11/27 22:52:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option 
parsing not performed. Implement the Tool interface and execute your application 
with ToolRunner to remedy this.

16/11/27 22:52:27 INFO input.FileInputFormat: Total input paths to process : 1

16/11/27 22:52:28 INFO mapreduce.JobSubmitter: number of splits:1

16/11/27 22:52:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14802
67251741_0001

16/11/27 22:52:32 INFO impl.YarnClientImpl: Submitted application application_14802
67251741_0001

16/11/27 22:52:33 INFO mapreduce.Job: The url to track the job: http://hadoop-Virtu
alBox:8088/proxy/application_1480267251741_0001/

16/11/27 22:52:33 INFO mapreduce.Job: Running job: job_1480267251741_0001

16/11/27 22:53:20 INFO mapreduce.Job: Job job_1480267251741_0001 running in uber mo
de : false

16/11/27 22:53:20 INFO mapreduce.Job:  map 0% reduce 0%

16/11/27 22:53:45 INFO mapreduce.Job:  map 100% reduce 0%

16/11/27 22:54:13 INFO mapreduce.Job:  map 100% reduce 100%

16/11/27 22:54:15 INFO mapreduce.Job: Job job_1480267251741_0001 completed 
successfully

16/11/27 22:54:16 INFO mapreduce.Job: Counters: 49

          File System Counters

                    FILE: Number of bytes read=124

                    FILE: Number of bytes written=237911

                    FILE: Number of read operations=0

                    FILE: Number of large read operations=0

                    FILE: Number of write operations=0

                    HDFS: Number of bytes read=150

                    HDFS: Number of bytes written=66

                    HDFS: Number of read operations=6

                    HDFS: Number of large read operations=0

                    HDFS: Number of write operations=2

          Job Counters

                    Launched map tasks=1

                    Launched reduce tasks=1

                    Data-local map tasks=1

                    Total time spent by all maps in occupied slots (ms)=21062

                    Total time spent by all reduces in occupied slots (ms)=25271

                    Total time spent by all map tasks (ms)=21062

                    Total time spent by all reduce tasks (ms)=25271

                    Total vcore-milliseconds taken by all map tasks=21062

                    Total vcore-milliseconds taken by all reduce tasks=25271

                    Total megabyte-milliseconds taken by all map tasks=21567488

                    Total megabyte-milliseconds taken by all reduce tasks=25877504

          Map-Reduce Framework

                    Map input records=2

                    Map output records=10

                    Map output bytes=98

                    Map output materialized bytes=124

                    Input split bytes=92

                    Combine input records=0

                    Combine output records=0

                    Reduce input groups=8

                    Reduce shuffle bytes=124

                    Reduce input records=10

                    Reduce output records=8

                    Spilled Records=20

                    Shuffled Maps =1

                    Failed Shuffles=0

                    Merged Map outputs=1

                    GC time elapsed (ms)=564

                    CPU time spent (ms)=4300

                    Physical memory (bytes) snapshot=330784768

                    Virtual memory (bytes) snapshot=3804205056

                    Total committed heap usage (bytes)=211812352

          Shuffle Errors

                    BAD_ID=0

                    CONNECTION=0

                    IO_ERROR=0

                    WRONG_LENGTH=0

                    WRONG_MAP=0

                    WRONG_REDUCE=0

          File Input Format Counters

                    Bytes Read=58

          File Output Format Counters

                    Bytes Written=66

hadoop@hadoop-VirtualBox:~$

After the program runs successfully, go to HDFS and check the part file inside output directory.

Below is the output of wordcount program.

hadoop@hadoop-VirtualBox:~$ hdfs dfs -cat /output/part-r-00000

 This    2
 first     1
 is        2
 mapreduce   1
 my      1
 program        1
 test     1
 wordcount     1
 hadoop@hadoop-VirtualBox:~$

Conclusion

This example here is in Java, you can write a MapReduce program in python as well. We successfully ran a Hadoop MapReduce program on a Hadoop Cluster on Ubuntu 16.04. The steps to run a Mapreduce Program on other Linux environments remain the same. Make sure that before running the program, you Hadoop cluster should be up and running, also your input file should be present in HDFS.