What is Hive¶
- Hive is Data warehousing tool developed on top of Hadoop Distributed File System (HDFS)
- Hive makes job easy for performing operations like
- Data encapsulation
- Ad-hoc queries
- Analysis of huge datasets
- Hive provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language)
- Queries or HQL gets converted to map reduce jobs by the Hive compiler
- Targeted towards users who are comfortable with SQL
- Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions (UDF)
Why Hive?¶
Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.
Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop.
How does Hive work?¶
When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the job plan.
In other words, these generic modules function like mini language interpreters and the language to drive the computation is encoded in XML.
How to run HQL?¶
- Hive batch
- Interactive Hive
Hive Batch¶
Run hive commands from the termial:
hive -e "SHOW DATABASES;"
Run Hive scripts from the termimal:
hive -f /path/to/file/withqueries.hql
Hive Interactive¶
Start Hive from a Terminal:
hive
Execute command within Hive
dfs -ls /;
Exit Hive:
exit;
DDL¶
DDL: databases¶
SHOW DATABASES;
.CREATE DATABASE IF NOT EXISTS myname;
DROP DATABASE IF EXISTS myname;
CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive';
Location
here is the HDFS path where the data for the database will be stored. Default is/user/hive/warehouse/
.SHOW DATABASES;
.SHOW DATABASES LIKE "d*";
.USE myname
.
DDL: CREATE table¶
CREATE TABLE IF NOT EXISTS yanfeikang.employees ( name STRING COMMENT 'Employee name', salary FLOAT COMMENT 'Employee salary') COMMENT 'Description of the table' TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');
DDL: ALTER¶
ALTER TABLE test_table_1 RENAME TO new_table;
ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);
DDL: DROP¶
DROP TABLE test_table_1
;
DML¶
DML: create an external table¶
- Assume we have a data file
stocks.txt
located in/home/yanfei/lectures/data/stocks.txt
, we could connect it with Hive as an external table.
cat code/L9/stocks/stocks.hql
use yanfeikang; create external table if not exists stocks ( symbol string, ymd string, price_open float, price_high float, price_low float, price_close float, volume int, price_adj_close float ) row format delimited fields terminated by ',' LOCATION '/data/'; LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/stocks.txt' OVERWRITE INTO TABLE stocks;
DML: create an external table¶
'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.
For example, if the stocks data is on hdfs location '/data/stocks.txt', you can load data directly from hdfs:
LOAD DATA INPATH '/data/stocks.txt' OVERWRITE INTO TABLE stocks;
In this case, you are moving data from hdfs to
LOCATION
, which specifies the HDFS path where the data for the table will be stored.If the data you want to analyze with Hive already exists in HDFS, you can create a Hive table over that data by specifying the existing location. This avoids the need to copy or move the data.
Where do Hive Tables stores in HDFS?¶
Hive stores tables files by default at
/user/hive/warehouse
location on HDFS.On this location, you can find the directories for all databases you create and subdirectories with the table name you use.
While creating Hive tables, you can also specify the custom location where to store.
External/internal tables¶
For External Tables, Hive stores the data in the hdfs
LOCATION
specified during creation of the table (generally not in warehouse directory). If the external table is dropped, then the table metadata is deleted but not the data. If you do not specifyLOCATION
when you create table, then default location/user/hive/warehouse
will be used.For Internal tables, Hive stores data into its warehouse directory. If the table is dropped then both the table metadata and the data will be deleted.
DML: queries¶
SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';
SELECT * FROM stocks WH ymd = '2003-01-02';
SELECT symbol, avg(price_close) FROM stocks GROUP BY symbol;
Wordcount example using hive¶
cat code/L9/wordcount/wordcount.hql
use yanfeikang; CREATE EXTERNAL TABLE IF NOT EXISTS fruits (line STRING); LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/fruits' OVERWRITE INTO TABLE fruits; CREATE TABLE IF NOT EXISTS fruit_counts AS SELECT name, count(*) AS count from (SELECT explode(split(line, ' ')) AS name FROM fruits) as w GROUP BY name ORDER BY name;
cat code/L9/movie/weekday_mapper.py
#! /usr/bin/env python3 import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print('\t'.join([userid, movieid, rating, str(weekday)]))
Use the mapper script:
cat code/L9/movie/wdtrans.hql
use yanfeikang; CREATE EXTERNAL TABLE if not exists u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; CREATE EXTERNAL TABLE if not exists u_data ( userid INT, movieid INT, rating INT, unixtime INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/ml-100k/u.data' OVERWRITE INTO TABLE u_data; add FILE hdfs:///user/yanfei/weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python3 weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;