How to Run a Hadoop MapReduce Program on Ubuntu 16.04?
Abstract: hadoop jar wordcount.jar /input /output hadoop@hadoop-VirtualBox~$ We have our input file onto HDFS for wordcount program. hadoop@hadoop-VirtualBox
In this blog, I will show you how to run a MapReduce program. MapReduce is one of the core part of Apache Hadoop, it is the processing layer of Apache Hadoop. So before I show you how to run a MapReduce program, let me briefly explain you MapReduce.
MapReduce is a system for parallel processing of large data sets. MapReduce reduces the data into results and creates a summary of the data. A mapreduce program has two parts - mapper and reducer. After the mapper finishes its work then only reducers start.
Mapper : It maps input key/value pairs to a set of intermediate key/value pairs.
k-means clustering in Python [with ...To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video
k-means clustering in Python [with example]Reducer : It reduces a set of intermediate values which share a key to a smaller set of values.
Basically, in the wordcount mapreduce program, we provide input file(s) - any text file, as input. When the mapreduce program starts, below are the processes it goes through:
Splitting : It splits the each line in the input file into words.
Mapping : It forms a key value pair, where word is the key and 1 is the value assigned to each key.
Shuffling : Common key value pairs get grouped together.
Reducing : The values of similar keys are added together.
Running MapReduce ProgramA MapReduce program is written in Java. And mostly Eclipse IDE is used for programming by the developers. So in this blog, I will show you how to export a mapreduce program into a jar file from Eclipse IDE and run it on a Hadoop cluster.
My MapReduce program is there in my Eclipse IDE.
Now to run this MapReduce program on a hadoop cluster, we will export the project as a jar file. Select File option in eclipse ide and click on Export. In Java option, select Jar file and click on Next.
Select the Wordcount project and give the path and name for the jar file, I am keeping it wordcount.jar, Click on Next twice.
Now Click on Browse and select the main class and finally click on Finish to make the jar file. In case you get any warning as below, just click OK.
Check whether your Hadoop cluster is up and working or not.
Command: jps
hadoop@hadoop-VirtualBox:~$ jps
3008 NodeManager
3924 Jps
2885 ResourceManager
2505 DataNode
3082 JobHistoryServer
2716 SecondaryNameNode
2383 NameNode
hadoop@hadoop-VirtualBox:~$
We have our input file onto HDFS for wordcount program.
hadoop@hadoop-VirtualBox:~$ hdfs dfs -put input /
hadoop@hadoop-VirtualBox:~$ hdfs dfs -cat /input
This is my first mapreduce test
This is wordcount program
hadoop@hadoop-VirtualBox:~$
Now run the wordcount.jar file using below command.
Note: Since we selected the main class while exporting wordcount.jar , so no need to mention the main class in the command.
Command: hadoop jar wordcount.jar /input /output
hadoop@hadoop-VirtualBox:~$ hadoop jar wordcount.jar /input /output
16/11/27 22:52:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:
8032
16/11/27 22:52:22 WARN mapreduce.JobResourceUploader: Hadoop command-line option
parsing not performed. Implement the Tool interface and execute your application
with ToolRunner to remedy this.
16/11/27 22:52:27 INFO input.FileInputFormat: Total input paths to process : 1
16/11/27 22:52:28 INFO mapreduce.JobSubmitter: number of splits:1
16/11/27 22:52:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14802
67251741_0001
16/11/27 22:52:32 INFO impl.YarnClientImpl: Submitted application application_14802
67251741_0001
16/11/27 22:52:33 INFO mapreduce.Job: The url to track the job: http://hadoop-Virtu
alBox:8088/proxy/application_1480267251741_0001/
16/11/27 22:52:33 INFO mapreduce.Job: Running job: job_1480267251741_0001
16/11/27 22:53:20 INFO mapreduce.Job: Job job_1480267251741_0001 running in uber mo
de : false
16/11/27 22:53:20 INFO mapreduce.Job: map 0% reduce 0%
16/11/27 22:53:45 INFO mapreduce.Job: map 100% reduce 0%
16/11/27 22:54:13 INFO mapreduce.Job: map 100% reduce 100%
16/11/27 22:54:15 INFO mapreduce.Job: Job job_1480267251741_0001 completed
successfully
16/11/27 22:54:16 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=124
FILE: Number of bytes written=237911
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes written=66
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=21062
Total time spent by all reduces in occupied slots (ms)=25271
Total time spent by all map tasks (ms)=21062
Total time spent by all reduce tasks (ms)=25271
Total vcore-milliseconds taken by all map tasks=21062
Total vcore-milliseconds taken by all reduce tasks=25271
Total megabyte-milliseconds taken by all map tasks=21567488
Total megabyte-milliseconds taken by all reduce tasks=25877504
Map-Reduce Framework
Map input records=2
Map output records=10
Map output bytes=98
Map output materialized bytes=124
Input split bytes=92
Combine input records=0
Combine output records=0
Reduce input groups=8
Reduce shuffle bytes=124
Reduce input records=10
Reduce output records=8
Spilled Records=20
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=564
CPU time spent (ms)=4300
Physical memory (bytes) snapshot=330784768
Virtual memory (bytes) snapshot=3804205056
Total committed heap usage (bytes)=211812352
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=58
File Output Format Counters
Bytes Written=66
hadoop@hadoop-VirtualBox:~$
After the program runs successfully, go to HDFS and check the part file inside output directory.
Below is the output of wordcount program.
hadoop@hadoop-VirtualBox:~$ hdfs dfs -cat /output/part-r-00000
This 2
first 1
is 2
mapreduce 1
my 1
program 1
test 1
wordcount 1
hadoop@hadoop-VirtualBox:~$
Conclusion
This example here is in Java, you can write a MapReduce program in python as well. We successfully ran a Hadoop MapReduce program on a Hadoop Cluster on Ubuntu 16.04. The steps to run a Mapreduce Program on other Linux environments remain the same. Make sure that before running the program, you Hadoop cluster should be up and running, also your input file should be present in HDFS.