Big Data Essentials¶

L9: Hive¶

Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

What is Hive¶

Hive is Data warehousing tool developed on top of Hadoop Distributed File System (HDFS)
Hive makes job easy for performing operations like
- Data encapsulation
- Ad-hoc queries
- Analysis of huge datasets
Hive provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language)
Queries or HQL gets converted to map reduce jobs by the Hive compiler
Targeted towards users who are comfortable with SQL
Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions （UDF)

Why Hive?¶

Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.
Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop.

How does Hive work?¶

When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the job plan
In other words, these generic modules function like mini language interpreters and the language to drive the computation is encoded in XML.

How to run HQL?¶

Hive batch
Interactive Hive

Hive Batch¶

Run hive commands from the termial: hive -e "SHOW DATABASES;"
Run Hive scripts from the termimal: hive -f /path/to/file/withqueries.hql

Hive Interactive¶

Start Hive from a Terminal: hive
Execute command within Hive dfs -ls /;
Exit Hive: exit;

DDL¶

DDL: databases¶

SHOW DATABASES;.
CREATE DATABASE IF NOT EXISTS myname;
DROP DATABASE IF EXISTS myname;
CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive';
SHOW DATABASES;.
SHOW DATABASES LIKE "d*";.
USE myname.

DDL: CREATE¶

CREATE TABLE IF NOT EXISTS yanfeikang.employees ( name STRING COMMENT 'Employee name', salary FLOAT COMMENT 'Employee salary') COMMENT 'Description of the table' TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');

SHOW TABLES;

DESCRIBE myname.employees;

DDL: ALTER¶

ALTER TABLE test_table_1 RENAME TO new_table;

ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);

DDL: DROP¶

DROP TABLE test_table_1;

DDL: SHOW¶

SHOW FUNCTIONS [like "str*"];

SHOW DATABASES [like "h*"];

SHOW TABLES [like "m*"];

DDL: DESCRIBE¶

DESCRIBE DATABASE bdg_financial;

DESCRIBE test_table_1;

DESCRIBE FUNCTION like;

DML¶

DML: create an external table¶

Assume we have a data file stocks.txt located in HDFS at /user/yanfei, we could connect it with Hive as an external table.

In [7]:

cat code/L9/stocks/stocks.hql

use yanfeikang;

create external table if not exists stocks (
    symbol string, 
    ymd string, 
    price_open float, 
    price_high float, 
    price_low float, 
    price_close float, 
    volume int, 
    price_adj_close float )
row format delimited fields terminated by ',' 
location '/user/yanfei/hive';

LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/stocks.txt' OVERWRITE INTO TABLE stocks;

'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.

LOAD DATA INPATH '/user/yanfei/stocks.txt' OVERWRITE INTO TABLE stocks;

DML: queries¶

SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';
SELECT * FROM stocks WH ymd = '2003-01-02';
SELECT symbol, avg(price_close) FROM stocks GROUP BY symbol;

Wordcount example using hive¶

In [8]:

cat code/L9/wordcount/wordcount.hql

use yanfeikang;

CREATE EXTERNAL TABLE fruits (line STRING);
LOAD DATA INPATH '/user/yanfei/fruits' OVERWRITE INTO TABLE fruits;
CREATE TABLE IF NOT EXISTS fruit_counts AS
    SELECT name, count(*) AS count from
        (SELECT explode(split(line, ' ')) AS name FROM fruits) w
    GROUP BY name
    ORDER BY name;

Transform: hive and python¶

When using hive, we can use python to process data. Now let's look at an example.
We will use the MovieLens 100k data available at /user/yanfei/u.data. Read more at here about the data.
We can do some analysis with the data.

In [4]:

cat code/L9/movie/weekday_mapper.py

#! /usr/bin/env python3.8

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

In [3]:

cat code/L9/movie/wdtrans.hql

use yanfeikang;

CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

LOAD DATA INPATH '/user/yanfei/u.data'
OVERWRITE INTO TABLE u_data;

add FILE hdfs:///user/yanfei/weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

HIVE reference¶

https://cwiki.apache.org/confluence/display/hive/

Lab 5¶

Create an external table with Hive for the data airdelay_small.csv
Use the Hive internal function do basic statistics as we had with Hadoop.
External Reading: Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012.