Objectives of this lecture¶
- Introduction to distributed computing
- Discovering Hadoop and why it’s so important
- Exploring the Hadoop Distributed File System
- Digging into Hadoop MapReduce
- Putting Hadoop to work
Why Distributed systems?¶
What is Hadoop?¶
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Explaining Hadoop¶
Hadoop was originally built by a Yahoo! engineer named Doug Cutting and is now an open source project managed by the Apache Software Foundation.
- Search engine innovators like Yahoo! and Google needed to find a way to make sense of the massive amounts of data that their engines were collecting.
- Hadoop was developed because it represented the most pragmatic way to allow companies to manage huge volumes of data easily.
- Hadoop allowed big problems to be broken down into smaller elements so that analysis could be done quickly and cost-effectively.
- By breaking the big data problem into small pieces that could be processed in parallel, you can process the information and regroup the small pieces to present results.
Who use Hadoop?¶
- Facebook uses Hadoop, Hive, and HB ase for data warehousing and real-time application serving.
- Twitter uses Hadoop, Pig, and HB ase for data analysis, visualization, social graph analysis, and machine learning.
- Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email antispam, ad optimization...
- eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL , Last.fm...
Now let us try these commands:¶
hadoop
echo $HADOOP_HOME
echo $HADOOP_CONF_DIR
hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] or hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS] where CLASSNAME is a user-provided Java class OPTIONS is none or any of: buildpaths attempt to add class files from build tree --config dir Hadoop config directory --debug turn on shell script debug mode --help usage information hostnames list[,of,host,names] hosts to use in slave mode hosts filename list of hosts to use in slave mode loglevel level set the log4j level for this command workers turn on worker mode SUBCOMMAND is one of: Admin Commands: daemonlog get/set the log level for each daemon Client Commands: archive create a Hadoop archive checknative check native Hadoop and compression libraries availability classpath prints the class path needed to get the Hadoop jar and the required libraries conftest validate configuration XML files credential interact with credential providers distch distributed metadata changer distcp copy file or directories recursively dtutil operations related to delegation tokens envvars display computed Hadoop environment variables fs run a generic filesystem user client gridmix submit a mix of synthetic job, modeling a profiled from production load jar <jar> run a jar file. NOTE: please use "yarn jar" to launch YARN applications, not this command. jnipath prints the java.library.path kdiag Diagnose Kerberos Problems kerbname show auth_to_local principal conversion key manage keys via the KeyProvider rumenfolder scale a rumen input trace rumentrace convert logs into a rumen trace s3guard manage metadata on S3 trace view and modify Hadoop tracing settings version print the version Daemon Commands: kms run KMS, the Key Management Server SUBCOMMAND may print help when invoked w/o parameters or with -h.
hadoop version
Hadoop 3.2.1 Source code repository git@gitlab.alibaba-inc.com:soe/emr-hadoop.git -r 42f2ce4ee2a135d4523a6bbdb3c90e2fe6472b94 Compiled by root on 2024-08-07T11:31Z Compiled with protoc 2.5.0 From source with checksum 1b543c4574cae11c43e3f6d84c15983d This command was run using /opt/apps/HADOOP-COMMON/hadoop-3.2.1-1.2.16-alinux3/share/hadoop/common/hadoop-common-3.2.1.jar
echo $HADOOP_HOME
/opt/apps/HADOOP-COMMON/hadoop-common-current/
echo $JAVA_HOME
/usr/lib/jvm/java-1.8.0
echo $HADOOP_CONF_DIR
/etc/taihao-apps/hadoop-conf
Modules of Hadoop¶
- Hadoop Distributed File System (HDFS): A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
- Hadoop MapReduce: A high-performance parallel/distributed data-processing implementation of the MapReduce algorithm.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop Common: The common utilities that support the other Hadoop modules.