Big Data Essentials

L9: Hive





Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

What is Hive

  • Hive is Data warehousing tool developed on top of Hadoop Distributed File System (HDFS)
  • Hive makes job easy for performing operations like
    • Data encapsulation
    • Ad-hoc queries
    • Analysis of huge datasets
  • Hive provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language)
  • Queries or HQL gets converted to map reduce jobs by the Hive compiler
  • Targeted towards users who are comfortable with SQL
  • Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions (UDF)

Why Hive?

  • Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.

  • Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop.

How does Hive work?

  • When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.

  • Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the job plan

  • In other words, these generic modules function like mini language interpreters and the language to drive the computation is encoded in XML.

How to run HQL?

  • Hive batch
  • Interactive Hive

Hive Batch

  • Run hive commands from the termial: hive -e "SHOW DATABASES;"

  • Run Hive scripts from the termimal: hive -f /path/to/file/withqueries.hql

Hive Interactive

  • Start Hive from a Terminal: hive

  • Execute command within Hive dfs -ls /;

  • Exit Hive: exit;

DDL

DDL: databases

  • SHOW DATABASES;.
  • CREATE DATABASE IF NOT EXISTS myname;
  • DROP DATABASE IF EXISTS myname;
  • CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive';
  • SHOW DATABASES;.
  • SHOW DATABASES LIKE "d*";.
  • USE myname.

DDL: CREATE

CREATE TABLE IF NOT EXISTS yanfeikang.employees ( name STRING COMMENT 'Employee name', salary FLOAT COMMENT 'Employee salary') COMMENT 'Description of the table' TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');

SHOW TABLES;

DESCRIBE myname.employees;

DDL: ALTER

ALTER TABLE test_table_1 RENAME TO new_table;

ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);

DDL: DROP

DROP TABLE test_table_1;

DDL: SHOW

SHOW FUNCTIONS [like "str*"];

SHOW DATABASES [like "h*"];

SHOW TABLES [like "m*"];

DDL: DESCRIBE

DESCRIBE DATABASE bdg_financial;

DESCRIBE test_table_1;

DESCRIBE FUNCTION like;

DML

DML: create an external table

  • Assume we have a data file stocks.txt located in HDFS at /user/yanfei, we could connect it with Hive as an external table.
In [7]:
cat code/L9/stocks/stocks.hql
use yanfeikang;

create external table if not exists stocks (
    symbol string, 
    ymd string, 
    price_open float, 
    price_high float, 
    price_low float, 
    price_close float, 
    volume int, 
    price_adj_close float )
row format delimited fields terminated by ',' 
location '/user/yanfei/hive';

LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/stocks.txt' OVERWRITE INTO TABLE stocks;
  • 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.

LOAD DATA INPATH '/user/yanfei/stocks.txt' OVERWRITE INTO TABLE stocks;

DML: queries

  • SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';
  • SELECT * FROM stocks WH ymd = '2003-01-02';
  • SELECT symbol, avg(price_close) FROM stocks GROUP BY symbol;

Wordcount example using hive

In [8]:
cat code/L9/wordcount/wordcount.hql
use yanfeikang;

CREATE EXTERNAL TABLE fruits (line STRING);
LOAD DATA INPATH '/user/yanfei/fruits' OVERWRITE INTO TABLE fruits;
CREATE TABLE IF NOT EXISTS fruit_counts AS
    SELECT name, count(*) AS count from
        (SELECT explode(split(line, ' ')) AS name FROM fruits) w
    GROUP BY name
    ORDER BY name;

Transform: hive and python

  • When using hive, we can use python to process data. Now let's look at an example.

  • We will use the MovieLens 100k data available at /user/yanfei/u.data. Read more at here about the data.

  • We can do some analysis with the data.

In [4]:
cat code/L9/movie/weekday_mapper.py
#! /usr/bin/env python3.8

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])
  • Use the mapper script:
In [3]:
cat code/L9/movie/wdtrans.hql
use yanfeikang;

CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

LOAD DATA INPATH '/user/yanfei/u.data'
OVERWRITE INTO TABLE u_data;

add FILE hdfs:///user/yanfei/weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

Lab 5

  • Create an external table with Hive for the data airdelay_small.csv

  • Use the Hive internal function do basic statistics as we had with Hadoop.

  • External Reading: Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012.