Yanfei Kang
yanfeikang@buaa.edu.cn
School of Economics and Management
Beihang University
http://yanfei.site
Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.
Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop.
When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the job plan
In other words, these generic modules function like mini language interpreters and the language to drive the computation is encoded in XML.
Run hive commands from the termial: hive -e "SHOW DATABASES;"
Run Hive scripts from the termimal: hive -f /path/to/file/withqueries.hql
Start Hive from a Terminal: hive
Execute command within Hive dfs -ls /;
Exit Hive: exit;
SHOW DATABASES;
.CREATE DATABASE IF NOT EXISTS myname;
DROP DATABASE IF EXISTS myname;
CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive';
SHOW DATABASES;
.SHOW DATABASES LIKE "d*";
.USE myname
.CREATE TABLE IF NOT EXISTS yanfeikang.employees (
name
STRING COMMENT 'Employee name',
salary
FLOAT COMMENT 'Employee salary')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');
SHOW TABLES;
DESCRIBE myname.employees;
ALTER TABLE test_table_1 RENAME TO new_table;
ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);
DROP TABLE test_table_1
;
stocks.txt
located in HDFS at /user/yanfei
, we could connect it with Hive as an external table.cat code/L9/stocks/stocks.hql
use yanfeikang; create external table if not exists stocks ( symbol string, ymd string, price_open float, price_high float, price_low float, price_close float, volume int, price_adj_close float ) row format delimited fields terminated by ',' location '/user/yanfei/hive'; LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/stocks.txt' OVERWRITE INTO TABLE stocks;
LOAD DATA INPATH '/user/yanfei/stocks.txt' OVERWRITE INTO TABLE stocks;
SELECT avg(price_close) FROM stocks WHERE symbol = 'AAPL';
SELECT * FROM stocks WH ymd = '2003-01-02';
SELECT symbol, avg(price_close) FROM stocks GROUP BY symbol;
cat code/L9/wordcount/wordcount.hql
use yanfeikang; CREATE EXTERNAL TABLE fruits (line STRING); LOAD DATA INPATH '/user/yanfei/fruits' OVERWRITE INTO TABLE fruits; CREATE TABLE IF NOT EXISTS fruit_counts AS SELECT name, count(*) AS count from (SELECT explode(split(line, ' ')) AS name FROM fruits) w GROUP BY name ORDER BY name;
cat code/L9/movie/weekday_mapper.py
#! /usr/bin/env python3.8 import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)])
cat code/L9/movie/wdtrans.hql
use yanfeikang; CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA INPATH '/user/yanfei/u.data' OVERWRITE INTO TABLE u_data; add FILE hdfs:///user/yanfei/weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;
Create an external table with Hive for the data airdelay_small.csv
Use the Hive internal function do basic statistics as we had with Hadoop.
External Reading: Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012.