L10: Exploring the World of Hadoop

Objectives of this lecture

  1. Discovering Hadoop and why it’s so important
  2. Exploring the Hadoop Distributed File System
  3. Digging into Hadoop MapReduce
  4. Putting Hadoop to work

Explaining Hadoop

Hadoop was originally built by a Yahoo! engineer named Doug Cutting and is now an open source project managed by the Apache Software Foundation.

  • Search engine innovators like Yahoo! and Google needed to find a way to make sense of the massive amounts of data that their engines were collecting.
  • Hadoop was developed because it represented the most pragmatic way to allow companies to manage huge volumes of data easily.
  • Hadoop allowed big problems to be broken down into smaller elements so that analysis could be done quickly and cost-effectively.
  • By breaking the big data problem into small pieces that could be processed in parallel, you can process the information and regroup the small pieces to present results.

Now try these commands on work-0:

  • hadoop
  • echo $HADOOP_HOME
  • echo $HADOOP_CLASSPATH
  • echo $HADOOP_CONF_DIR

Two primary components of Hadoop

  • Hadoop Distributed File System (HDFS): A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
  • MapReduce engine: A high-performance parallel/distributed data- processing implementation of the MapReduce algorithm.

HDFS

  • HDFS is the world’s most reliable storage system.
  • It is a data service that offers a unique set of capabilities needed when data volumes and velocity are high.
  • The service includes a “NameNode” and multiple “data nodes”.

NameNodes

  • HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes make up the complete file.
  • Data nodes are not very smart, but the NameNode is. The data nodes constantly ask the NameNode whether there is anything for them to do. This continuous behavior also tells the NameNode what data nodes are out there and how busy they are.

DataNodes

  • Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by the NameNode.

Under the covers of HDFS

  • Big data brings the big challenges of volume, velocity, and variety.
  • HDFS addresses these challenges by breaking files into a related collection of smaller blocks. These blocks are distributed among the data nodes in the HDFS cluster and are managed by the NameNode.
  • Block sizes are configurable and are usually 128 megabytes (MB) or 256MB, meaning that a 1GB file consumes eight 128MB blocks for its basic storage needs.
  • HDFS is resilient, so these blocks are replicated throughout the cluster in case of a server failure. How does HDFS keep track of all these pieces? The short answer is file system metadata.
  • HDFS metadata is stored in the NameNode, and while the cluster is operating, all the metadata is loaded into the physical memory of the NameNode server.
  • Data nodes are very simplistic. They are servers that contain the blocks for a given set of files.

Data storage of HDFS

  • Multiple copies of each block are stored across the cluster on different nodes.
  • This is a replication of data. By default, HDFS replication factor is 3.
  • It provides fault tolerance, reliability, and high availability.
  • A Large file is split into n number of small blocks. These blocks are stored at different nodes in the cluster in a distributed manner. Each block is replicated and stored across different nodes in the cluster.

HDFS commands

  • hadoop fs
  • hadoop fs -help
  • hadoop fs -ls /
  • hadoop fs -mkdir yourID
  • hadoop fs -ls /user/student
  • hadoop fs -mv LICENSE license.txt

Hadoop MapReduce

  • Google released a paper on MapReduce technology in December, 2004. This became the genesis of the Hadoop Processing Model.
  • MapReduce is a programming model that allows us to perform parallel and distributed processing on huge data sets.

Traditional way

Traditional way

  • Critical path problem: It is the amount of time taken to finish the job without delaying the next milestone or actual completion date. So, if, any of the machines delays the job, the whole work gets delayed.
  • Reliability problem: What if, any of the machines which is working with a part of data fails? The management of this failover becomes a challenge.
  • Equal split issue: How will I divide the data into smaller chunks so that each machine gets even part of data to work with. In other words, how to equally divide the data such that no individual machine is overloaded or under utilized.
  • Single split may fail: If any of the machine fails to provide the output, I will not be able to calculate the result. So, there should be a mechanism to ensure this fault tolerance capability of the system.
  • Aggregation of result: There should be a mechanism to aggregate the result generated by each of the machines to produce the final output.

MapReduce

MapReduce gives you the flexibility to write code logic without caring about the design issues of the system.

MapReduce

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.

  • MapReduce consists of two distinct tasks – Map and Reduce.
  • As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
  • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

MapReduce example - word count

MapReduce example - word count

hadoop jar /opt/apps/ecm/service/hadoop/2.8.5-1.2.0/package/hadoop-2.8.5-1.2.0/share/hadoop/tools/lib/hadoop-streaming-2.8.5.jar \ -input /user/yanfeikang/license.txt \ -output /user/yanfeikang/output \ -mapper "/usr/bin/cat" \ -reducer "/usr/bin/wc"

MapReduce example - word count

hadoop jar /opt/apps/ecm/service/hadoop/2.8.5-1.2.0/package/hadoop-2.8.5-1.2.0/share/hadoop/tools/lib/hadoop-streaming-2.8.5.jar \ -input /user/yanfeikang/license.txt \ -output /user/yanfeikang/output \ -mapper "/usr/bin/cat" \ -reducer "/usr/bin/wc" \ -numReduceTasks 1