Jenny (Xiao) Zhang

6 Things You Need To Know About Hadoop


What is Hadoop?

Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project. It is licensed under the Apache License 2.0.

Why Hadoop?

Since there are a lot of low cost storage available, the key challenge for handling Big Data is how to read large amount of data fast. Hadoop is good for reading a big file in one shot. The default data block size (the smallest unit of data that a file system can store) of Hadoop is 64MB. The block size in disk is generally 4KB.

Hadoop today is not for: low latency data access , lots of small files ( each file will need metadata) , multiple writes, arbitrary file miseducation.

Hadoop Ecosystem

Hadoop Eco System


Flume – Import and Export Unstructured or Semi-Structured Data to/from Hadoop.

Sqoop (SQL+Hadoop) – Import and Export Structured Data to/from Hadoop.

HDFS – Hadoop distributed file system.

MapReduce – a programming model and associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

HBase – NoSQL Database, read/ write from HDFS to MapReduce, can be used for OLTP.

Pig  – data analysis tool originally developed by yahoo, use procedural data-flow language (PigLatin), good for semi-structured data.

Hive – data warehouse tool, use SQL like language (HiveQL), good for structured data.

Mahout – a machine learning framework, used to develop social network/E- commerce recommendations.

Apache oozie – workflow scheduler and management tool, can schedule and run Hadoop jobs in parallel.





HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode and a number of DataNodes.

NameNode is the master. It maintains and manages the blocks that are presented on DataNodes. It manages all the metadata of HDFS. NameNode only stores information in RAM. NameNode is associated with Job Tracker. There is a secondary NameNode in the system but it is not a hot standby of the NameNode. It reads from NameNode’s RAM and write to a file system ( hard disk) and used for disaster recovery.

DataNode is the slave. It serves the read/write requests from the clients. DataNode is associated with Task Tracker. A file is split into one or more blocks and these blocks are stored in a set of DataNodes.

The beauty of Hadoop is data localization. Traditional DFS is transferring TB of data in the network to process. Hadoop has the concept of data localization and only transfers KB level of code in the network. Data is processed locally in DataNodes.




1. User (a person) copies the input file into the Distributed File System.

2. User submits the job to Client (software).

3. Client gets information about the input file.

4. Client split the job into multiple splits.

5. Client upload the job information to DFS.

6. Client submits job to Job Tracker.

7. Job Tracker initializes the job in job queue.

8. Job Tracker reads job files from DFS to understand the job.

9. Job Tracker creates Map tasks and Reduce Tasks based on the job type. The number of Map tasks equal the number of input splits, which is configurable. Each Map task is running on one input split. The output of the Map task will go to Reduce Task. The number of Reduce tasks generated can be defined. The Map and Reduce tasks are running on DataNodes.

10. Task Trackers send Heartbeats to Job Tracker to let it know they are available for tasks.

11. Job Tracker picks the Task Trackers that have the most local Data.

12. Job Tracker assign tasks to Task Trackers.

Once the tasks are completed, Task Tracker sends Heartbeat to Job Tracker agains and Job Tracker will assign more tasks.

Hadoop 1.0 vs 2.0


There are following challenges in Hadoop 1.0:

  • Horizontal scalability of NameNode (bottleneck after 4000 nodes)
  • NameNode is a single point of failure of the system
  • Over burn of job tracker
  • Cannot run non-MapReduce applications on HDFS
  • No multi-tenancy: Ability to run multiple types of jobs on the same resource at the same time

How Hadoop 2.0 solve the challenges of Hadoop 1.0?


HDFS Federation – different NameNodes for different organizations. One Namespace has one NameNode. NameNodes are independent and do not talk to each other. Data is spread on large scale of DataNodes. All DataNodes are used as common storage for all NameNodes. Each DataNode is registered with all NameNodes. There is one block pool for one NameNode / Namespace but one DataNode can belong to multiple Namespaces. 


NameNode High Availability – One Namespace has one active NameNode, one stand by NameNode, and one secondary NameNode (optional).

Fail over process – If NameNode does not response in 10 seconds, the system assumes it is dead. All DataNodes will talk to the stand by NameNode, which becomes the active NameNode. But the issue is the NameNode may not really die. Maybe the network is slow. 
Fencing – make sure the failed NameNode is actually dead. Kill all active processes on that NameNode and then kill the NameNode – stonith, send special power supply signal and stop the power supply. Need to manually bring the dead NameNode back.
YARN – yet another resource negotiator (has nothing to do with MapReduce), better processing control, support non MapReduce processing. Resource manage replaces job tracker (scheduling , applications manager- manage jobs). Node manager replaces task tracker.
Multi-tenancy – Yarn supports Multi-tenancy, which means you can run multiple types of jobs (batch, interactive, streaming) on the same resource at the same time. There are multiple job queues. Each queue has a priority and shares certain percent of cluster. FIFO in each queue.
Thank you very much for reading this blog. Please feel free to contact me if you have any questions or want to learn more about Hadoop.
Note: all the graphs in this blog come from internet.
Tagged on: , ,
%d bloggers like this: