Big Data Essentials¶

L9: Hive¶

Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site

What is Hive¶

Hive is Data warehousing tool developed on top of Hadoop Distributed File System (HDFS)
Hive makes job easy for performing operations like
- Data encapsulation
- Ad-hoc queries
- Analysis of huge datasets
Hive provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language)
Queries or HQL gets converted to map reduce jobs by the Hive compiler
Targeted towards users who are comfortable with SQL
Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions （UDF)

Why Hive?¶

Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.
Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop.

How does Hive work?¶

When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the job plan.
In other words, these generic modules function like mini language interpreters and the language to drive the computation is encoded in XML.

How to run HQL?¶

Hive batch
Interactive Hive

Hive Batch¶

Run hive commands from the termial: hive -e "SHOW DATABASES;"
Run Hive scripts from the termimal: hive -f /path/to/file/withqueries.hql

Hive Interactive¶

Start Hive from a Terminal: hive
Execute command within Hive dfs -ls /;
Exit Hive: exit;

DDL¶

DDL: databases¶

SHOW DATABASES;.
CREATE DATABASE IF NOT EXISTS myname;
DROP DATABASE IF EXISTS myname;
CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive'; Location here is the HDFS path where the data for the database will be stored. Default is /user/hive/warehouse/.
SHOW DATABASES;.
SHOW DATABASES LIKE "d*";.
USE myname.

DDL: CREATE table¶

CREATE TABLE IF NOT EXISTS yanfeikang.employees ( name STRING COMMENT 'Employee name', salary FLOAT COMMENT 'Employee salary') COMMENT 'Description of the table' TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');

DDL: table¶

SHOW TABLES;

DESCRIBE myname.employees;

DDL: ALTER¶

ALTER TABLE test_table_1 RENAME TO new_table;

ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);

DDL: DROP¶

DROP TABLE test_table_1;

DDL: SHOW¶

SHOW FUNCTIONS [like "str*"];

SHOW DATABASES [like "h*"];

SHOW TABLES [like "m*"];

DDL: DESCRIBE¶

DESCRIBE DATABASE yanfeikang;

DESCRIBE test_table_1;

DESCRIBE FUNCTION like;

DML¶

DML: create an external table¶

Assume we have a data file stocks.txt located in /home/yanfei/lectures/data/stocks.txt, we could connect it with Hive as an external table.

In [2]:

cat code/L9/stocks/stocks.hql

use yanfeikang;

create external table if not exists stocks (
    symbol string, 
    ymd string, 
    price_open float, 
    price_high float, 
    price_low float, 
    price_close float, 
    volume int, 
    price_adj_close float )
row format delimited fields terminated by ','
LOCATION '/data/';

LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/stocks.txt' OVERWRITE INTO TABLE stocks;

DML: create an external table¶

'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
For example, if the stocks data is on hdfs location '/data/stocks.txt', you can load data directly from hdfs:

LOAD DATA INPATH '/data/stocks.txt' OVERWRITE INTO TABLE stocks;

In this case, you are moving data from hdfs to LOCATION, which specifies the HDFS path where the data for the table will be stored.
If the data you want to analyze with Hive already exists in HDFS, you can create a Hive table over that data by specifying the existing location. This avoids the need to copy or move the data.

Where do Hive Tables stores in HDFS?¶

Hive stores tables files by default at /user/hive/warehouse location on HDFS.
On this location, you can find the directories for all databases you create and subdirectories with the table name you use.
While creating Hive tables, you can also specify the custom location where to store.

External/internal tables¶

For External Tables, Hive stores the data in the hdfs LOCATION specified during creation of the table (generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data. If you do not specify LOCATION when you create table, then default location /user/hive/warehouse will be used.
For Internal tables, Hive stores data into its warehouse directory. If the table is dropped then both the table metadata and the data will be deleted.

DML: queries¶

SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';
SELECT * FROM stocks WH ymd = '2003-01-02';
SELECT symbol, avg(price_close) FROM stocks GROUP BY symbol;

Wordcount example using hive¶

In [1]:

cat code/L9/wordcount/wordcount.hql

use yanfeikang;

CREATE EXTERNAL TABLE IF NOT EXISTS fruits (line STRING);
LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/fruits' OVERWRITE INTO TABLE fruits;
CREATE TABLE IF NOT EXISTS fruit_counts AS
    SELECT name, count(*) AS count from
        (SELECT explode(split(line, ' ')) AS name FROM fruits) as w
    GROUP BY name
    ORDER BY name;

Transform: hive and python¶

When using hive, we can use python to process data. Now let's look at an example.
We will use the MovieLens 100k data (u.data) available here.
We can do some analysis with the data.

In [3]:

cat code/L9/movie/weekday_mapper.py

#! /usr/bin/env python3

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print('\t'.join([userid, movieid, rating, str(weekday)]))

Use the mapper script:

In [4]:

cat code/L9/movie/wdtrans.hql

use yanfeikang;

CREATE EXTERNAL TABLE if not exists u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

CREATE EXTERNAL TABLE if not exists u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/ml-100k/u.data'
OVERWRITE INTO TABLE u_data;

add FILE hdfs:///user/yanfei/weekday_mapper.py;

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python3 weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

Big Data Essentials¶

L9: Hive¶

What is Hive¶

Why Hive?¶

How does Hive work?¶

How to run HQL?¶

Hive Batch¶

Hive Interactive¶

DDL¶

DDL: databases¶

DDL: CREATE table¶

DDL: table¶

DDL: ALTER¶

DDL: DROP¶

DDL: SHOW¶

DDL: DESCRIBE¶

DML¶

DML: create an external table¶

DML: create an external table¶

Where do Hive Tables stores in HDFS?¶

External/internal tables¶

DML: queries¶

Wordcount example using hive¶

Transform: hive and python¶

HIVE reference¶