How to Setup a Single Node Hadoop Cluster Using Docker?

Channel: Linux
Abstract: 54a0f749c9e0a14c58904e3e

In this article, I will show you how to setup a single node hadoop cluster using Docker. Before I start with the setup, let me briefly remind you what Docker and Hadoop are.

Docker is a software containerization platform where you package your application with all the libraries, dependencies, environments in a container. This container is called docker container. With Docker, you can build, ship, run an application (software) on the fly.

For example, if you want to test an application on an ubuntu system, you need not setup a complete operating system on your laptop/desktop or start a virtual machine with ubuntu os. That will take a lot of time and space. You can simply start an ubuntu docker container which will have the environment, libraries you need to test your application on the fly.

how to install and run docker and c...

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video

how to install and run docker and containers in docker machine

Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers. These days it is one of the most important technology in the industry. Now to use Hadoop to store and analyse huge amount of data, you need to setup a hadoop cluster. If you have done setting up of hadoop cluster before, you know its not a easy task.

What if I say, setting up a hadoop cluster is hardly 5-10 minutes job, will you believe me? I guess not!

Here is where Docker comes into picture, and using docker you can setup a hadoop cluster in no time.

Benefits of using Docker for setting up a hadoop cluster

  • Installs and runs hadoop in no time.
  • Uses the resources as per need, so no wastage of resource.
  • Easily scalable, best suited for testing environments in hadoop cluster.
  • No worries of hadoop dependencies, libraries etc. , docker will take care of it.
Setup a Single Node Hadoop Cluster Using Docker

So let us see now how to setup a single node hadoop cluster using Docker. I am using Ubuntu 16.04 system and docker is already installed and configured on my system.

Before I setup a single node hadoop cluster using docker, let me just run simple example to see that docker is working correctly on my system.

Let me check if I have any docker image as of now.

hadoop@hadoop-VirtualBox:~$ docker ps

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

I don't have any docker image as of now. Let me run a simple hello-world docker example.

hadoop@hadoop-VirtualBox:~$ docker run hello-world

Hello from Docker!

This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
  1. The Docker client contacted the Docker daemon.
  1. The Docker daemon pulled the "hello-world" image from the Docker Hub.
  1. The Docker daemon created a new container from that image which runs the

executable that produces the output you are currently reading.

  1. The Docker daemon streamed that output to the Docker client, which sent it

to your terminal. To try something more ambitious, you can run an Ubuntu container with: $ docker run -it ubuntu bash Share images, automate workflows, and more with a free Docker Hub account: https://hub.docker.com For more examples and ideas, visit: https://docs.docker.com/engine/userguide/

So now you know that docker is working properly. Let us go ahead and install hadoop in a docker container. To do so,  we need a hadoop docker image. The below command will get me a hadoop-2.7.1 docker image.

hadoop@hadoop-VirtualBox:~$ sudo docker pull sequenceiq/hadoop-docker:2.7.1

[sudo] password for hadoop:

2.7.1: Pulling from sequenceiq/hadoop-docker

b253335dcf03: Pull complete

a3ed95caeb02: Pull complete

11c8cd810974: Pull complete

49d8575280f2: Pull complete

2240837237fc: Pull complete

e727168a1e18: Pull complete

ede4c89e7b84: Pull complete

a14c58904e3e: Pull complete

8d72113f79e9: Pull complete

44bc7aa001db: Pull complete

f1af80e588d1: Pull complete

54a0f749c9e0: Pull complete

f620e24d35d5: Pull complete

ff68d052eb73: Pull complete

d2f5cd8249bc: Pull complete

5d3c1e2c16b1: Pull complete

6e1d5d78f75c: Pull complete

a0d5160b2efd: Pull complete

b5c5006d9017: Pull complete

6a8c6da42d5b: Pull complete

13d1ee497861: Pull complete

e3be4bdd7a5c: Pull complete

391fb9240903: Pull complete

Digest: sha256:0ae1419989844ca8b655dea261b92554740ec3c133e0826866c49319af7359db

Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.1

Run below command to check whether the hadoop docket image got downloaded correctly.

hadoop@hadoop-VirtualBox:~$ docker images

REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE

hello-world                latest              c54a2cc56cbb        5 months ago        1.848 kB

sequenceiq/hadoop-docker   2.7.1               e3c6e05ab051        2 years ago         1.516 GB

hadoop@hadoop-VirtualBox:~$

Now run this docker image, which will create a docker container where hadoop-2.7.1 will run.

hadoop@hadoop-VirtualBox:~$ docker run -it sequenceiq/hadoop-docker:2.7.1 /etc/bootstrap.sh -bash

/

Starting sshd:                                             [  OK  ]

Starting namenodes on [e34a63e1dcf8]

e34a63e1dcf8: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-e34a63e1dcf8.out

localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-e34a63e1dcf8.out

Starting secondary namenodes [0.0.0.0]

0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-e34a63e1dcf8.out

starting yarn daemons

starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-e34a63e1dcf8.out

localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-e34a63e1dcf8.out

Now that the docker container has started, run jps command to see if the hadoop services are up and running.

bash-4.1# jps

291 SecondaryNameNode

560 NodeManager

856 Jps

107 NameNode

483 ResourceManager

180 DataNode

bash-4.1#

Open a new terminal and run below command to see the list of containers which are running and their details.

hadoop@hadoop-VirtualBox:~$ docker ps

CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS                                                                                                                   NAMES

e34a63e1dcf8        sequenceiq/hadoop-docker:2.7.1   "/etc/bootstrap.sh -b"   44 minutes ago      Up 44 minutes       22/tcp, 8030-8033/tcp, 8040/tcp, 8042/tcp, 8088/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp   condescending_poincare

Go back to you docker container terminal, and run below command to get the ip address of the docker container.

bash-4.1# ifconfig

eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02

inet addr:172.17.0.2  Bcast:0.0.0.0  Mask:255.255.0.0

inet6 addr: fe80::42:acff:fe11:2/64 Scope:Link

UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

RX packets:56 errors:0 dropped:0 overruns:0 frame:0

TX packets:31 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:0

RX bytes:6803 (6.6 KiB)  TX bytes:2298 (2.2 KiB)

 

lo        Link encap:Local Loopback

inet addr:127.0.0.1  Mask:255.0.0.0

inet6 addr: ::1/128 Scope:Host

UP LOOPBACK RUNNING  MTU:65536  Metric:1

RX packets:28648 errors:0 dropped:0 overruns:0 frame:0

TX packets:28648 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1

RX bytes:4079499 (3.8 MiB)  TX bytes:4079499 (3.8 MiB)

bash-4.1#

After running jps command, we already saw that all the services were running, let us now check the namenode ui on the browser. Go to 172.17.0.2 :50070 in the browser, and there you go, namenode ui of a hadoop cluster running in a docker container.

Just to make sure that the hadoop cluster is working fine, let us run a hadoop mapreduce example in the docker container.

bash-4.1# cd $HADOOP_PREFIX

bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'

16/11/29 13:07:02 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

16/11/29 13:07:07 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).

16/11/29 13:07:08 INFO input.FileInputFormat: Total input paths to process : 27

16/11/29 13:07:10 INFO mapreduce.JobSubmitter: number of splits:27

16/11/29 13:07:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480434980067_0001

16/11/29 13:07:14 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.

16/11/29 13:07:15 INFO impl.YarnClientImpl: Submitted application application_1480434980067_0001

16/11/29 13:07:16 INFO mapreduce.Job: The url to track the job: http://e34a63e1dcf8:8088/proxy/application_1480434980067_0001/

16/11/29 13:07:16 INFO mapreduce.Job: Running job: job_1480434980067_0001

16/11/29 13:07:58 INFO mapreduce.Job: Job job_1480434980067_0001 running in uber mode : false

16/11/29 13:07:58 INFO mapreduce.Job:  map 0% reduce 0%

16/11/29 13:10:44 INFO mapreduce.Job:  map 22% reduce 0%

16/11/29 13:13:40 INFO mapreduce.Job:  map 22% reduce 7%

16/11/29 13:13:41 INFO mapreduce.Job:  map 26% reduce 7%

16/11/29 13:20:30 INFO mapreduce.Job:  map 96% reduce 32%

16/11/29 13:21:01 INFO mapreduce.Job:  map 100% reduce 32%

16/11/29 13:21:04 INFO mapreduce.Job:  map 100% reduce 100%

16/11/29 13:21:08 INFO mapreduce.Job: Job job_1480434980067_0001 completed successfully

16/11/29 13:21:10 INFO mapreduce.Job: Counters: 50

File System Counters

FILE: Number of bytes read=345

FILE: Number of bytes written=2621664

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=64780

HDFS: Number of bytes written=437

HDFS: Number of read operations=84

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Launched map tasks=29

Launched reduce tasks=1

Data-local map tasks=29

Map-Reduce Framework

Map input records=1586

Map output records=24

Bytes Written=437

16/11/29 13:21:10 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032

16/11/29 13:21:10 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).

16/11/29 13:21:10 INFO input.FileInputFormat: Total input paths to process : 1

16/11/29 13:21:12 INFO mapreduce.JobSubmitter: number of splits:1

16/11/29 13:21:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480434980067_0002

16/11/29 13:21:13 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.

16/11/29 13:21:14 INFO impl.YarnClientImpl: Submitted application application_1480434980067_0002

16/11/29 13:21:14 INFO mapreduce.Job: The url to track the job: http://e34a63e1dcf8:8088/proxy/application_1480434980067_0002/

16/11/29 13:21:14 INFO mapreduce.Job: Running job: job_1480434980067_0002

16/11/29 13:21:48 INFO mapreduce.Job: Job job_1480434980067_0002 running in uber mode : false

16/11/29 13:21:48 INFO mapreduce.Job:  map 0% reduce 0%

16/11/29 13:22:12 INFO mapreduce.Job:  map 100% reduce 0%

16/11/29 13:22:37 INFO mapreduce.Job:  map 100% reduce 100%

16/11/29 13:22:38 INFO mapreduce.Job: Job job_1480434980067_0002 completed successfully

16/11/29 13:22:38 INFO mapreduce.Job: Counters: 49

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Map-Reduce Framework

Map input records=11

Map output records=11

Map output bytes=263

Map output materialized bytes=291

Input split bytes=132

Physical memory (bytes) snapshot=334082048

Virtual memory (bytes) snapshot=1297162240

Total committed heap usage (bytes)=209518592

File Input Format Counters

Bytes Read=437

File Output Format Counters

Bytes Written=197

bash-4.1#

Check the output.

bash-4.1# bin/hdfs dfs -cat output/*

6             dfs.audit.logger

4             dfs.class

3             dfs.server.namenode.

2             dfs.period

2             dfs.audit.log.maxfilesize

2             dfs.audit.log.maxbackupindex

1             dfsmetrics.log

1             dfsadmin

1             dfs.servers

1             dfs.replication

1             dfs.file

bash-4.1#
Conclusion

We successfully ran a single node hadoop cluster using docker. You saw, we had to do nothing to setup the hadoop cluster, and within no time we had an up and running hadoop cluster. As mentioned before, Docker is mostly used for testing environments, so if you want to test an hadoop application, setting up hadoop cluster in a docker container and testing the hadoop application is the easiest and fastest way.

Ref From: linoxide
Channels:

Related articles