Big Data Essentials¶

Exploring the World of Hadoop¶





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

Objectives of this lecture¶

  1. Introduction to distributed computing
  2. Discovering Hadoop and why it’s so important
  3. Exploring the Hadoop Distributed File System
  4. Digging into Hadoop MapReduce
  5. Putting Hadoop to work

Why Distributed systems?¶

What is Hadoop?¶

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Explaining Hadoop¶

Hadoop was originally built by a Yahoo! engineer named Doug Cutting and is now an open source project managed by the Apache Software Foundation.

  • Search engine innovators like Yahoo! and Google needed to find a way to make sense of the massive amounts of data that their engines were collecting.
  • Hadoop was developed because it represented the most pragmatic way to allow companies to manage huge volumes of data easily.
  • Hadoop allowed big problems to be broken down into smaller elements so that analysis could be done quickly and cost-effectively.
  • By breaking the big data problem into small pieces that could be processed in parallel, you can process the information and regroup the small pieces to present results.

Who use Hadoop?¶

  • Facebook uses Hadoop, Hive, and HB ase for data warehousing and real-time application serving.
  • Twitter uses Hadoop, Pig, and HB ase for data analysis, visualization, social graph analysis, and machine learning.
  • Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization...
  • eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL , Last.fm...

Now let us try these commands:¶

  • hadoop
  • echo $HADOOP_HOME
  • echo $HADOOP_CONF_DIR
In [1]:
hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog     get/set the log level for each daemon

    Client Commands:

archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
              required libraries
conftest      validate configuration XML files
credential    interact with credential providers
distch        distributed metadata changer
distcp        copy file or directories recursively
dtutil        operations related to delegation tokens
envvars       display computed Hadoop environment variables
fs            run a generic filesystem user client
gridmix       submit a mix of synthetic job, modeling a profiled from
              production load
jar <jar>     run a jar file. NOTE: please use "yarn jar" to launch YARN
              applications, not this command.
jnipath       prints the java.library.path
kdiag         Diagnose Kerberos Problems
kerbname      show auth_to_local principal conversion
key           manage keys via the KeyProvider
rumenfolder   scale a rumen input trace
rumentrace    convert logs into a rumen trace
s3guard       manage metadata on S3
trace         view and modify Hadoop tracing settings
version       print the version

    Daemon Commands:

kms           run KMS, the Key Management Server

SUBCOMMAND may print help when invoked w/o parameters or with -h.

In [2]:
hadoop version
Hadoop 3.2.1
Source code repository git@gitlab.alibaba-inc.com:soe/emr-hadoop.git -r 42f2ce4ee2a135d4523a6bbdb3c90e2fe6472b94
Compiled by root on 2024-08-07T11:31Z
Compiled with protoc 2.5.0
From source with checksum 1b543c4574cae11c43e3f6d84c15983d
This command was run using /opt/apps/HADOOP-COMMON/hadoop-3.2.1-1.2.16-alinux3/share/hadoop/common/hadoop-common-3.2.1.jar
In [3]:
echo $HADOOP_HOME
/opt/apps/HADOOP-COMMON/hadoop-common-current/
In [4]:
echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0
In [5]:
echo $HADOOP_CONF_DIR
/etc/taihao-apps/hadoop-conf

Modules of Hadoop¶

  • Hadoop Distributed File System (HDFS): A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
  • Hadoop MapReduce: A high-performance parallel/distributed data-processing implementation of the MapReduce algorithm.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop Common: The common utilities that support the other Hadoop modules.