Big Data Essentials

L7: Exploring the World of Hadoop





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

Objectives of this lecture

  1. Discovering Hadoop and why it’s so important
  2. Exploring the Hadoop Distributed File System
  3. Digging into Hadoop MapReduce
  4. Putting Hadoop to work

Explaining Hadoop

Hadoop was originally built by a Yahoo! engineer named Doug Cutting and is now an open source project managed by the Apache Software Foundation.

  • Search engine innovators like Yahoo! and Google needed to find a way to make sense of the massive amounts of data that their engines were collecting.
  • Hadoop was developed because it represented the most pragmatic way to allow companies to manage huge volumes of data easily.
  • Hadoop allowed big problems to be broken down into smaller elements so that analysis could be done quickly and cost-effectively.
  • By breaking the big data problem into small pieces that could be processed in parallel, you can process the information and regroup the small pieces to present results.

Now let us try these commands:

  • hadoop
  • echo $HADOOP_HOME
  • echo $HADOOP_CLASSPATH
  • echo $HADOOP_CONF_DIR
In [13]:
hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

tput: No value for $TERM and no -T specified
buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

tput: No value for $TERM and no -T specified
daemonlog     get/set the log level for each daemon

    Client Commands:

tput: No value for $TERM and no -T specified
archive       create a Hadoop archive
checknative   check native Hadoop and compression libraries availability
classpath     prints the class path needed to get the Hadoop jar and the
              required libraries
conftest      validate configuration XML files
credential    interact with credential providers
distch        distributed metadata changer
distcp        copy file or directories recursively
dtutil        operations related to delegation tokens
envvars       display computed Hadoop environment variables
fs            run a generic filesystem user client
gridmix       submit a mix of synthetic job, modeling a profiled from
              production load
jar <jar>     run a jar file. NOTE: please use "yarn jar" to launch YARN
              applications, not this command.
jnipath       prints the java.library.path
kdiag         Diagnose Kerberos Problems
kerbname      show auth_to_local principal conversion
key           manage keys via the KeyProvider
rumenfolder   scale a rumen input trace
rumentrace    convert logs into a rumen trace
s3guard       manage metadata on S3
trace         view and modify Hadoop tracing settings
version       print the version

    Daemon Commands:

tput: No value for $TERM and no -T specified
kms           run KMS, the Key Management Server

SUBCOMMAND may print help when invoked w/o parameters or with -h.

In [13]:
hadoop version
Hadoop 3.1.3
Source code repository git@gitlab.alibaba-inc.com:soe/emr-hadoop.git -r b024c0ab171face7c2f0c07ca1e15fbc27268333
Compiled by jenkins on 2020-07-16T05:50Z
Compiled with protoc 2.5.0
From source with checksum ea3bea418ed679131ec1b32a9d35d1
This command was run using /opt/apps/ecm/service/hadoop/3.1.3-1.0.2/package/hadoop-3.1.3-1.0.2/share/hadoop/common/hadoop-common-3.1.3.jar
In [2]:
echo $HADOOP_HOME
/usr/lib/hadoop-current
In [14]:
echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0
In [3]:
echo $HADOOP_CLASSPATH
/usr/lib/hadoop-current/lib/*:/usr/lib/tez-current/*:/usr/lib/tez-current/lib/*:/etc/ecm/tez-conf:/usr/lib/hadoop-current/lib/*:/usr/lib/tez-current/*:/usr/lib/tez-current/lib/*:/etc/ecm/tez-conf:/opt/apps/extra-jars/*:/usr/lib/spark-current/yarn/spark-2.4.5-yarn-shuffle.jar:/opt/apps/extra-jars/*:/usr/lib/spark-current/yarn/spark-2.4.5-yarn-shuffle.jar
In [4]:
echo $HADOOP_CONF_DIR
/etc/ecm/hadoop-conf

Two primary components of Hadoop

  • Hadoop Distributed File System (HDFS): A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
  • MapReduce engine: A high-performance parallel/distributed data-processing implementation of the MapReduce algorithm.

HDFS

  • HDFS is the world’s most reliable storage system.
  • It is a data service that offers a unique set of capabilities needed when data volumes and velocity are high.
  • The service includes a “NameNode” and multiple “data nodes”.

NameNodes

  • HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes make up the complete file.
  • Data nodes are not very smart, but the NameNode is. The data nodes constantly ask the NameNode whether there is anything for them to do. This continuous behavior also tells the NameNode what data nodes are out there and how busy they are.

DataNodes

  • Within the HDFS cluster, data blocks are replicated across multiple data nodes and access is managed by the NameNode.

Under the covers of HDFS

  • Big data brings the big challenges of volume, velocity, and variety.
  • HDFS addresses these challenges by breaking files into a related collection of smaller blocks. These blocks are distributed among the data nodes in the HDFS cluster and are managed by the NameNode.
  • Block sizes are configurable and are usually 128 megabytes (MB) or 256MB, meaning that a 1GB file consumes eight 128MB blocks for its basic storage needs.
  • HDFS is resilient, so these blocks are replicated throughout the cluster in case of a server failure. How does HDFS keep track of all these pieces? The short answer is file system metadata.
  • HDFS metadata is stored in the NameNode, and while the cluster is operating, all the metadata is loaded into the physical memory of the NameNode server.
  • Data nodes are very simplistic. They are servers that contain the blocks for a given set of files.

Data storage of HDFS

Data storage of HDFS

  • Multiple copies of each block are stored across the cluster on different nodes.
  • This is a replication of data. By default, HDFS replication factor is 3.
  • It provides fault tolerance, reliability, and high availability.
  • A Large file is split into n number of small blocks. These blocks are stored at different nodes in the cluster in a distributed manner. Each block is replicated and stored across different nodes in the cluster.

HDFS commands

  • hadoop fs
  • hadoop fs -help
  • hadoop fs -ls /
  • hadoop fs -ls /user/student
  • hadoop fs -mv LICENSE license.txt
  • hadoop fs -mkdir yourNAME
In [6]:
hadoop fs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] [-s <sleep interval>] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]


In [7]:
hadoop fs -help
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
	[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
	[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-v] [-x] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
	[-head <file>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] [-s <sleep interval>] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]

-appendToFile <localsrc> ... <dst> :
  Appends the contents of all the given local files to the given dst file. The dst
  file will be created if it does not exist. If <localSrc> is -, then the input is
  read from stdin.

-cat [-ignoreCrc] <src> ... :
  Fetch all files that match the file pattern <src> and display their content on
  stdout.

-checksum <src> ... :
  Dump checksum information for files that match the file pattern <src> to stdout.
  Note that this requires a round-trip to a datanode storing each block of the
  file, and thus is not efficient to run on a large number of files. The checksum
  of a file depends on its content, block size and the checksum algorithm and
  parameters used for creating the file.

-chgrp [-R] GROUP PATH... :
  This is equivalent to -chown ... :GROUP ...

-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH... :
  Changes permissions of a file. This works similar to the shell's chmod command
  with a few exceptions.
                                                                                 
  -R           modifies the files recursively. This is the only option currently 
               supported.                                                        
  <MODE>       Mode is the same as mode used for the shell's command. The only   
               letters recognized are 'rwxXt', e.g. +t,a+r,g-w,+rwx,o=r.         
  <OCTALMODE>  Mode specifed in 3 or 4 digits. If 4 digits, the first may be 1 or
               0 to turn the sticky bit on or off, respectively.  Unlike the     
               shell command, it is not possible to specify only part of the     
               mode, e.g. 754 is same as u=rwx,g=rx,o=r.                         
  
  If none of 'augo' is specified, 'a' is assumed and unlike the shell command, no
  umask is applied.

-chown [-R] [OWNER][:[GROUP]] PATH... :
  Changes owner and group of a file. This is similar to the shell's chown command
  with a few exceptions.
                                                                                 
  -R  modifies the files recursively. This is the only option currently          
      supported.                                                                 
  
  If only the owner or group is specified, then only the owner or group is
  modified. The owner and group names may only consist of digits, alphabet, and
  any of [-_./@a-zA-Z0-9]. The names are case sensitive.
  
  WARNING: Avoid using '.' to separate user name and group though Linux allows it.
  If user names have dots in them and you are using local file system, you might
  see surprising results since the shell command 'chown' is used for local files.

-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                                 
  -p                 Preserves access and modification times, ownership and the  
                     mode.                                                       
  -f                 Overwrites the destination if it already exists.            
  -t <thread count>  Number of threads to be used, default is 1.                 
  -l                 Allow DataNode to lazily persist the file to disk. Forces   
                     replication factor of 1. This flag will result in reduced   
                     durability. Use with care.                                  
  -d                 Skip creation of temporary file(<dst>._COPYING_).           

-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
  Identical to the -get command.

-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ... :
  Count the number of directories, files and bytes under the paths
  that match the specified file pattern.  The output columns are:
  DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  or, with the -q option:
  QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA
        DIR_COUNT FILE_COUNT CONTENT_SIZE PATHNAME
  The -h option shows file sizes in human readable format.
  The -v option displays a header line.
  The -x option excludes snapshots from being calculated. 
  The -t option displays quota by storage types.
  It should be used with -q or -u option, otherwise it will be ignored.
  If a comma-separated list of storage types is given after the -t option, 
  it displays the quota and usage for the specified types. 
  Otherwise, it displays the quota and usage for all the storage 
  types that support quota. The list of possible storage types(case insensitive):
  ram_disk, ssd, disk and archive.
  It can also pass the value '', 'all' or 'ALL' to specify all the storage types.
  The -u option shows the quota and 
  the usage against the quota without the detailed content summary.The -e option
  shows the erasure coding policy.

-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst> :
  Copy files that match the file pattern <src> to a destination.  When copying
  multiple files, the destination must be a directory. Passing -p preserves status
  [topax] (timestamps, ownership, permission, ACLs, XAttr). If -p is specified
  with no <arg>, then preserves timestamps, ownership, permission. If -pa is
  specified, then preserves permission also because ACL is a super-set of
  permission. Passing -f overwrites the destination if it already exists. raw
  namespace extended attributes are preserved if (1) they are supported (HDFS
  only) and, (2) all of the source and target pathnames are in the /.reserved/raw
  hierarchy. raw namespace xattr preservation is determined solely by the presence
  (or absence) of the /.reserved/raw prefix and not by the -p option. Passing -d
  will skip creation of temporary file(<dst>._COPYING_).

-createSnapshot <snapshotDir> [<snapshotName>] :
  Create a snapshot on a directory

-deleteSnapshot <snapshotDir> <snapshotName> :
  Delete a snapshot from a directory

-df [-h] [<path> ...] :
  Shows the capacity, free and used space of the filesystem. If the filesystem has
  multiple partitions, and no path to a particular partition is specified, then
  the status of the root partitions will be shown.
                                                                                 
  -h  Formats the sizes of files in a human-readable fashion rather than a number
      of bytes.                                                                  

-du [-s] [-h] [-v] [-x] <path> ... :
  Show the amount of space, in bytes, used by the files that match the specified
  file pattern. The following flags are optional:
                                                                                 
  -s  Rather than showing the size of each individual file that matches the      
      pattern, shows the total (summary) size.                                   
  -h  Formats the sizes of files in a human-readable fashion rather than a number
      of bytes.                                                                  
  -v  option displays a header line.                                             
  -x  Excludes snapshots from being counted.                                     
  
  Note that, even without the -s option, this only shows size summaries one level
  deep into a directory.
  
  The output is in the form 
  	size	disk space consumed	name(full path)

-expunge :
  Delete files from the trash that are older than the retention threshold

-find <path> ... <expression> ... :
  Finds all files that match the specified expression and
  applies selected actions to them. If no <path> is specified
  then defaults to the current working directory. If no
  expression is specified then defaults to -print.
  
  The following primary expressions are recognised:
    -name pattern
    -iname pattern
      Evaluates as true if the basename of the file matches the
      pattern using standard file system globbing.
      If -iname is used then the match is case insensitive.
  
    -print
    -print0
      Always evaluates to true. Causes the current pathname to be
      written to standard output followed by a newline. If the -print0
      expression is used then an ASCII NULL character is appended rather
      than a newline.
  
  The following operators are recognised:
    expression -a expression
    expression -and expression
    expression expression
      Logical AND operator for joining two expressions. Returns
      true if both child expressions return true. Implied by the
      juxtaposition of two expressions and so does not need to be
      explicitly specified. The second expression will not be
      applied if the first fails.

-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst> :
  Copy files that match the file pattern <src> to the local name.  <src> is kept. 
  When copying multiple files, the destination must be a directory. Passing -f
  overwrites the destination if it already exists and -p preserves access and
  modification times, ownership and the mode.

-getfacl [-R] <path> :
  Displays the Access Control Lists (ACLs) of files and directories. If a
  directory has a default ACL, then getfacl also displays the default ACL.
                                                                  
  -R      List the ACLs of all files and directories recursively. 
  <path>  File or directory to list.                              

-getfattr [-R] {-n name | -d} [-e en] <path> :
  Displays the extended attribute names and values (if any) for a file or
  directory.
                                                                                 
  -R             Recursively list the attributes for all files and directories.  
  -n name        Dump the named extended attribute value.                        
  -d             Dump all extended attribute values associated with pathname.    
  -e <encoding>  Encode values after retrieving them.Valid encodings are "text", 
                 "hex", and "base64". Values encoded as text strings are enclosed
                 in double quotes ("), and values encoded as hexadecimal and     
                 base64 are prefixed with 0x and 0s, respectively.               
  <path>         The file or directory.                                          

-getmerge [-nl] [-skip-empty-file] <src> <localdst> :
  Get all the files in the directories that match the source file pattern and
  merge and sort them to only one file on local fs. <src> is kept.
                                                                     
  -nl               Add a newline character at the end of each file. 
  -skip-empty-file  Do not add new line character for empty file.    

-head <file> :
  Show the first 1KB of the file.

-help [cmd ...] :
  Displays help for given command or all commands if none is specified.

-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...] :
  List the contents that match the specified file pattern. If path is not
  specified, the contents of /user/<currentUser> will be listed. For a directory a
  list of its direct children is returned (unless -d option is specified).
  
  Directory entries are of the form:
  	permissions - userId groupId sizeOfDirectory(in bytes)
  modificationDate(yyyy-MM-dd HH:mm) directoryName
  
  and file entries are of the form:
  	permissions numberOfReplicas userId groupId sizeOfFile(in bytes)
  modificationDate(yyyy-MM-dd HH:mm) fileName
  
    -C  Display the paths of files and directories only.
    -d  Directories are listed as plain files.
    -h  Formats the sizes of files in a human-readable fashion
        rather than a number of bytes.
    -q  Print ? instead of non-printable characters.
    -R  Recursively list the contents of directories.
    -t  Sort files by modification time (most recent first).
    -S  Sort files by size.
    -r  Reverse the order of the sort.
    -u  Use time of last access instead of modification for
        display and sorting.

-mkdir [-p] <path> ... :
  Create a directory in specified location.
                                                  
  -p  Do not fail if the directory already exists 

-moveFromLocal <localsrc> ... <dst> :
  Same as -put, except that the source is deleted after it's copied.

-moveToLocal <src> <localdst> :
  Not implemented yet

-mv <src> ... <dst> :
  Move files that match the specified file pattern <src> to a destination <dst>. 
  When moving multiple files, the destination must be a directory.

-put [-f] [-p] [-l] [-d] <localsrc> ... <dst> :
  Copy files from the local file system into fs. Copying fails if the file already
  exists, unless the -f flag is given.
  Flags:
                                                                       
  -p  Preserves access and modification times, ownership and the mode. 
  -f  Overwrites the destination if it already exists.                 
  -l  Allow DataNode to lazily persist the file to disk. Forces        
         replication factor of 1. This flag will result in reduced
         durability. Use with care.
                                                        
  -d  Skip creation of temporary file(<dst>._COPYING_). 

-renameSnapshot <snapshotDir> <oldName> <newName> :
  Rename a snapshot from oldName to newName

-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ... :
  Delete all files that match the specified file pattern. Equivalent to the Unix
  command "rm <src>"
                                                                                 
  -f          If the file does not exist, do not display a diagnostic message or 
              modify the exit status to reflect an error.                        
  -[rR]       Recursively deletes directories.                                   
  -skipTrash  option bypasses trash, if enabled, and immediately deletes <src>.  
  -safely     option requires safety confirmation, if enabled, requires          
              confirmation before deleting large directory with more than        
              <hadoop.shell.delete.limit.num.files> files. Delay is expected when
              walking over large directory recursively to count the number of    
              files to be deleted before the confirmation.                       

-rmdir [--ignore-fail-on-non-empty] <dir> ... :
  Removes the directory entry specified by each directory argument, provided it is
  empty.

-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] :
  Sets Access Control Lists (ACLs) of files and directories.
  Options:
                                                                                 
  -b          Remove all but the base ACL entries. The entries for user, group   
              and others are retained for compatibility with permission bits.    
  -k          Remove the default ACL.                                            
  -R          Apply operations to all files and directories recursively.         
  -m          Modify ACL. New entries are added to the ACL, and existing entries 
              are retained.                                                      
  -x          Remove specified ACL entries. Other ACL entries are retained.      
  --set       Fully replace the ACL, discarding all existing entries. The        
              <acl_spec> must include entries for user, group, and others for    
              compatibility with permission bits.                                
  <acl_spec>  Comma separated list of ACL entries.                               
  <path>      File or directory to modify.                                       

-setfattr {-n name [-v value] | -x name} <path> :
  Sets an extended attribute name and value for a file or directory.
                                                                                 
  -n name   The extended attribute name.                                         
  -v value  The extended attribute value. There are three different encoding     
            methods for the value. If the argument is enclosed in double quotes, 
            then the value is the string inside the quotes. If the argument is   
            prefixed with 0x or 0X, then it is taken as a hexadecimal number. If 
            the argument begins with 0s or 0S, then it is taken as a base64      
            encoding.                                                            
  -x name   Remove the extended attribute.                                       
  <path>    The file or directory.                                               

-setrep [-R] [-w] <rep> <path> ... :
  Set the replication level of a file. If <path> is a directory then the command
  recursively changes the replication factor of all files under the directory tree
  rooted at <path>. The EC files will be ignored here.
                                                                                 
  -w  It requests that the command waits for the replication to complete. This   
      can potentially take a very long time.                                     
  -R  It is accepted for backwards compatibility. It has no effect.              

-stat [format] <path> ... :
  Print statistics about the file/directory at <path>
  in the specified format. Format accepts permissions in
  octal (%a) and symbolic (%A), filesize in
  bytes (%b), type (%F), group name of owner (%g),
  name (%n), block size (%o), replication (%r), user name
  of owner (%u), access date (%x, %X).
  modification date (%y, %Y).
  %x and %y show UTC date as "yyyy-MM-dd HH:mm:ss" and
  %X and %Y show milliseconds since January 1, 1970 UTC.
  If the format is not specified, %y is used by default.

-tail [-f] [-s <sleep interval>] <file> :
  Show the last 1KB of the file.
                                                                               
  -f  Shows appended data as the file grows.                                   
  -s  With -f , defines the sleep interval between iterations in milliseconds. 

-test -[defsz] <path> :
  Answer various questions about <path>, with result via exit status.
    -d  return 0 if <path> is a directory.
    -e  return 0 if <path> exists.
    -f  return 0 if <path> is a file.
    -s  return 0 if file <path> is greater         than zero bytes in size.
    -w  return 0 if file <path> exists         and write permission is granted.
    -r  return 0 if file <path> exists         and read permission is granted.
    -z  return 0 if file <path> is         zero bytes in size, else return 1.

-text [-ignoreCrc] <src> ... :
  Takes a source file and outputs the file in text format.
  The allowed formats are zip and TextRecordInputStream and Avro.

-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ... :
  Updates the access and modification times of the file specified by the <path> to
  the current time. If the file does not exist, then a zero length file is created
  at <path> with current time as the timestamp of that <path>.
  -a Change only the access time 
  -m Change only the modification time 
  -t TIMESTAMP Use specified timestamp (in format yyyyMMddHHmmss) instead of
  current time 
  -c Do not create any files

-touchz <path> ... :
  Creates a file of zero length at <path> with current time as the timestamp of
  that <path>. An error is returned if the file exists with non-zero length

-truncate [-w] <length> <path> ... :
  Truncate all files that match the specified file pattern to the specified
  length.
                                                                                 
  -w  Requests that the command wait for block recovery to complete, if          
      necessary.                                                                 

-usage [cmd ...] :
  Displays the usage for given command or all commands if none is specified.

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

In [8]:
hadoop fs -ls /
Found 37 items
-rw-r-----   2 devel     hadoop          2 2020-10-04 15:16 /1.txt
drwxr-x--x   - devel     hadoop          0 2020-09-27 15:57 /2019210718
-rw-r-----   2 devel     hadoop          6 2020-09-27 15:21 /2019210718gly
-rw-r-----   2 devel     hadoop          0 2020-09-26 22:37 /2020211004chenxi
drwxr-x--x   - devel     hadoop          0 2020-10-24 11:49 /211019lqh
-rw-r-----   2 devel     hadoop          0 2020-09-26 23:15 /23.15.txt
drwxr-x--x   - devel     hadoop          0 2020-09-27 18:08 /aaa
drwxr-x--x   - hadoop    hadoop          0 2020-09-01 16:44 /apps
drwxr-xr-x   - devel     hadoop          0 2020-09-24 21:43 /devel
drwxr-x--x   - devel     hadoop          0 2020-10-13 20:55 /dyyput
drwxrwxrwx   - flowagent hadoop          0 2020-09-01 16:43 /emr-flow
drwxr-x--x   - root      hadoop          0 2020-09-01 16:44 /emr-sparksql-udf
-rw-r-----   2 devel     hadoop       2511 2020-10-14 17:06 /estella
drwxr-x--x   - devel     hadoop          0 2020-10-13 20:38 /gyyput
drwxr-x--x   - devel     hadoop          0 2020-10-10 17:09 /home
drwxr-x--x   - devel     hadoop          0 2020-10-11 22:04 /input
drwxr-x--x   - devel     hadoop          0 2020-09-28 15:35 /lqhoutput1
drwxr-x--x   - devel     hadoop          0 2020-10-12 10:38 /lqhoutput2
drwxr-x--x   - devel     hadoop          0 2020-10-12 10:05 /lqhoutput3
drwxr-x--x   - devel     hadoop          0 2020-10-17 14:24 /output
-rw-r-----   2 devel     hadoop         16 2020-09-26 22:16 /panjinlian.txt
drwxr-x--x   - devel     hadoop          0 2020-09-27 18:17 /qian
-rw-r-----   2 devel     hadoop         20 2020-09-27 14:10 /qq
-rw-r-----   2 devel     hadoop        151 2020-10-09 21:22 /ren
-rw-r-----   2 devel     hadoop        151 2020-10-09 21:42 /ren.txt
drwxr-x--x   - devel     hadoop          0 2020-09-26 22:17 /sanguo
drwxr-x--x   - hadoop    hadoop          0 2020-10-08 17:23 /spark-history
drwxr-x--x   - devel     hadoop          0 2020-10-18 21:13 /students
-rw-r-----   2 devel     hadoop          0 2020-09-24 20:20 /test
drwxrwxrwx   - root      hadoop          0 2020-10-24 14:05 /tmp
drwxr-x--t   - hadoop    hadoop          0 2020-10-12 21:06 /user
drwxr-x--x   - devel     hadoop          0 2020-10-08 17:36 /usr
drwxr-x--x   - devel     hadoop          0 2020-10-13 21:00 /uyput
drwxr-x--x   - devel     hadoop          0 2020-09-27 20:28 /whnoutput
-rw-r-----   2 devel     hadoop         43 2020-10-09 21:47 /yy.txt
drwxr-x--x   - devel     hadoop          0 2020-10-09 22:07 /yyput
drwxr-x--x   - devel     hadoop          0 2020-10-13 20:50 /zyyput
In [17]:
hadoop fs -ls /user/
Found 12 items
drwxr-x--x   - devel  hadoop          0 2020-09-28 10:17 /user/111
drwxr-x--x   - devel  hadoop          0 2020-09-27 18:27 /user/aaa
drwxr-x--x   - devel  hadoop          0 2020-09-24 18:14 /user/de02
drwxr-x--x   - devel  hadoop          0 2020-10-21 14:23 /user/devel
drwxr-x--x   - devel  hadoop          0 2020-10-14 23:00 /user/estella
drwxr-x--x   - hadoop hadoop          0 2020-10-11 16:08 /user/hadoop
drwxr-x--x   - hadoop hadoop          0 2020-09-01 16:44 /user/hive
drwxr-x--x   - devel  hadoop          0 2020-09-27 18:21 /user/lf
drwxr-x--x   - devel  hadoop          0 2020-09-24 18:01 /user/lzl
drwxr-x--x   - root   hadoop          0 2020-09-01 16:44 /user/root
drwxr-x--x   - yanfei hadoop          0 2020-10-24 14:21 /user/yanfei
drwxr-x--x   - devel  hadoop          0 2020-10-19 23:20 /user/zhaonan
In [21]:
hadoop fs -put /home/yanfei/lectures/BDE-L7-hadoop.ipynb .
20/10/24 14:23:45 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
In [23]:
hadoop fs -ls /user/yanfei
Found 1 items
-rw-r-----   2 yanfei hadoop      52945 2020-10-24 14:23 /user/yanfei/BDE-L7-hadoop.ipynb
In [27]:
hadoop fs -mv BDE-L7-hadoop.ipynb hadoop.ipynb
In [28]:
hadoop fs -ls /user/yanfei
Found 1 items
-rw-r-----   2 yanfei hadoop      52945 2020-10-24 14:23 /user/yanfei/hadoop.ipynb

Hadoop MapReduce

  • Google released a paper on MapReduce technology in December, 2004. This became the genesis of the Hadoop Processing Model.
  • MapReduce is a programming model that allows us to perform parallel and distributed processing on huge data sets.

Traditional way

Traditional way

  • Critical path problem: It is the amount of time taken to finish the job without delaying the next milestone or actual completion date. So, if, any of the machines delays the job, the whole work gets delayed.
  • Reliability problem: What if, any of the machines which is working with a part of data fails? The management of this failover becomes a challenge.
  • Equal split issue: How will I divide the data into smaller chunks so that each machine gets even part of data to work with. In other words, how to equally divide the data such that no individual machine is overloaded or under utilized.
  • Single split may fail: If any of the machine fails to provide the output, I will not be able to calculate the result. So, there should be a mechanism to ensure this fault tolerance capability of the system.
  • Aggregation of result: There should be a mechanism to aggregate the result generated by each of the machines to produce the final output.

MapReduce

MapReduce gives you the flexibility to write code logic without caring about the design issues of the system.

MapReduce

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.

  • MapReduce consists of two distinct tasks – Map and Reduce.
  • As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed.
  • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

MapReduce example - word count

MapReduce example - word count

In [4]:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar \
    -input /user/yanfei/hadoop.ipynb \
    -output /user/yanfei/output \
    -mapper "/usr/bin/cat" \
    -reducer "/usr/bin/wc"
packageJobJar: [/tmp/hadoop-unjar8969007596139248477/] [] /tmp/streamjob2945745223499147347.jar tmpDir=null
20/10/25 19:15:23 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-47977/192.168.0.2:8032
20/10/25 19:15:23 INFO client.AHSProxy: Connecting to Application History server at emr-header-1.cluster-47977/192.168.0.2:10200
20/10/25 19:15:23 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-47977/192.168.0.2:8032
20/10/25 19:15:23 INFO client.AHSProxy: Connecting to Application History server at emr-header-1.cluster-47977/192.168.0.2:10200
20/10/25 19:15:23 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yanfei/.staging/job_1603624118298_0007
20/10/25 19:15:23 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/10/25 19:15:23 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
20/10/25 19:15:23 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 97184efe294f64a51a4c5c172cbc22146103da53]
20/10/25 19:15:23 INFO mapred.FileInputFormat: Total input files to process : 1
20/10/25 19:15:23 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/10/25 19:15:23 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/10/25 19:15:23 INFO mapreduce.JobSubmitter: number of splits:16
20/10/25 19:15:23 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/10/25 19:15:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1603624118298_0007
20/10/25 19:15:23 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/10/25 19:15:23 INFO conf.Configuration: resource-types.xml not found
20/10/25 19:15:23 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/10/25 19:15:24 INFO impl.YarnClientImpl: Submitted application application_1603624118298_0007
20/10/25 19:15:24 INFO mapreduce.Job: The url to track the job: http://emr-header-1.cluster-47977:20888/proxy/application_1603624118298_0007/
20/10/25 19:15:24 INFO mapreduce.Job: Running job: job_1603624118298_0007
20/10/25 19:15:29 INFO mapreduce.Job: Job job_1603624118298_0007 running in uber mode : false
20/10/25 19:15:29 INFO mapreduce.Job:  map 0% reduce 0%
20/10/25 19:15:33 INFO mapreduce.Job:  map 100% reduce 0%
20/10/25 19:15:37 INFO mapreduce.Job:  map 100% reduce 29%
20/10/25 19:15:38 INFO mapreduce.Job:  map 100% reduce 57%
20/10/25 19:15:39 INFO mapreduce.Job:  map 100% reduce 86%
20/10/25 19:15:40 INFO mapreduce.Job:  map 100% reduce 100%
20/10/25 19:15:40 INFO mapreduce.Job: Job job_1603624118298_0007 completed successfully
20/10/25 19:15:40 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=19186
		FILE: Number of bytes written=5413892
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=115439
		HDFS: Number of bytes written=175
		HDFS: Number of read operations=83
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=14
	Job Counters 
		Killed map tasks=1
		Launched map tasks=16
		Launched reduce tasks=7
		Data-local map tasks=16
		Total time spent by all maps in occupied slots (ms)=2644533
		Total time spent by all reduces in occupied slots (ms)=1438746
		Total time spent by all map tasks (ms)=43353
		Total time spent by all reduce tasks (ms)=11793
		Total vcore-milliseconds taken by all map tasks=43353
		Total vcore-milliseconds taken by all reduce tasks=11793
		Total megabyte-milliseconds taken by all map tasks=84625056
		Total megabyte-milliseconds taken by all reduce tasks=46039872
	Map-Reduce Framework
		Map input records=1196
		Map output records=1196
		Map output bytes=54172
		Map output materialized bytes=30745
		Input split bytes=1840
		Combine input records=0
		Combine output records=0
		Reduce input groups=628
		Reduce shuffle bytes=30745
		Reduce input records=1196
		Reduce output records=7
		Spilled Records=2392
		Shuffled Maps =112
		Failed Shuffles=0
		Merged Map outputs=112
		GC time elapsed (ms)=1433
		CPU time spent (ms)=15750
		Physical memory (bytes) snapshot=10207895552
		Virtual memory (bytes) snapshot=94894026752
		Total committed heap usage (bytes)=29416226816
		Peak Map Physical memory (bytes)=497131520
		Peak Map Virtual memory (bytes)=3606556672
		Peak Reduce Physical memory (bytes)=335630336
		Peak Reduce Virtual memory (bytes)=5317402624
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=113599
	File Output Format Counters 
		Bytes Written=175
20/10/25 19:15:40 INFO streaming.StreamJob: Output directory: /user/yanfei/output
In [5]:
hadoop fs -ls /user/yanfei/output
Found 8 items
-rw-r-----   2 yanfei hadoop          0 2020-10-25 19:15 /user/yanfei/output/_SUCCESS
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00000
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00001
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00002
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00003
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00004
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00005
-rw-r-----   2 yanfei hadoop         25 2020-10-25 19:15 /user/yanfei/output/part-00006
In [8]:
hadoop fs -cat /user/yanfei/output/*
20/10/25 19:27:28 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
    202     973    7959	
20/10/25 19:27:29 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
    217     708    6661	
    197    1085   10139	
    109     737    5612	
    189    1019    9124	
    140     976    7797	
    142     823    6849	

MapReduce example - word count

In [ ]:
hadoop fs -rm -r /user/yanfei/output
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar \
    -input /user/yanfei/hadoop.ipynb \
    -output /user/yanfei/output \
    -mapper "/usr/bin/cat" \
    -reducer "/usr/bin/wc" \
    -numReduceTasks 1
rm: `/user/yanfei/output': No such file or directory
packageJobJar: [/tmp/hadoop-unjar6744909520225807933/] [] /tmp/streamjob2784735596554188573.jar tmpDir=null
20/11/02 10:48:20 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-47977/192.168.0.2:8032
20/11/02 10:48:20 INFO client.AHSProxy: Connecting to Application History server at emr-header-1.cluster-47977/192.168.0.2:10200
20/11/02 10:48:20 INFO client.RMProxy: Connecting to ResourceManager at emr-header-1.cluster-47977/192.168.0.2:8032
20/11/02 10:48:20 INFO client.AHSProxy: Connecting to Application History server at emr-header-1.cluster-47977/192.168.0.2:10200
20/11/02 10:48:20 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/yanfei/.staging/job_1603624118298_0248
20/11/02 10:48:20 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/11/02 10:48:20 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
20/11/02 10:48:20 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 97184efe294f64a51a4c5c172cbc22146103da53]
20/11/02 10:48:20 INFO mapred.FileInputFormat: Total input files to process : 1
20/11/02 10:48:20 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/11/02 10:48:20 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/11/02 10:48:20 INFO mapreduce.JobSubmitter: number of splits:16
20/11/02 10:48:20 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
20/11/02 10:48:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1603624118298_0248
20/11/02 10:48:20 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/11/02 10:48:20 INFO conf.Configuration: resource-types.xml not found
20/11/02 10:48:20 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/11/02 10:48:21 INFO impl.YarnClientImpl: Submitted application application_1603624118298_0248
20/11/02 10:48:21 INFO mapreduce.Job: The url to track the job: http://emr-header-1.cluster-47977:20888/proxy/application_1603624118298_0248/
20/11/02 10:48:21 INFO mapreduce.Job: Running job: job_1603624118298_0248
In [12]:
hadoop fs -cat /user/yanfei/output/*
20/10/25 19:29:17 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
   1196    6321   54141