Many of those low-level details are actually quite repetitive from one job to the next, from low-level chores like wiring together Mappers and Reducers to certain data manipulation constructs, like filtering for just the data you want and performing SQL-like joins on data sets.
Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes-tricky coding you would have to do in Hadoop.
When MapReduce jobs are required, Hive doesn't generate Java MapReduce programs.
Instead, it uses built-in, generic Mapper and Reducer modules that are driven by an XML file representing the job plan
In other words, these generic modules function like mini language interpreters and the language to drive the computation is encoded in XML.
CREATE DATABASE IF NOT EXISTS myname;
DROP DATABASE IF EXISTS myname;
CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfei/hive';
SHOW DATABASES LIKE "d*";.
CREATE TABLE IF NOT EXISTS yanfeikang.employees (
STRING COMMENT 'Employee name',
FLOAT COMMENT 'Employee salary')
COMMENT 'Description of the table'
TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');
ALTER TABLE test_table_1 RENAME TO new_table;
ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);
stocks.txtlocated in HDFS at
/user/yanfei, we could connect it with Hive as an external table.
use yanfeikang; create external table if not exists stocks ( symbol string, ymd string, price_open float, price_high float, price_low float, price_close float, volume int, price_adj_close float ) row format delimited fields terminated by ',' location '/user/yanfei/hive'; LOAD DATA LOCAL INPATH '/home/yanfei/lectures/data/stocks.txt' OVERWRITE INTO TABLE stocks;
LOAD DATA INPATH '/user/yanfei/stocks.txt' OVERWRITE INTO TABLE stocks;
use yanfeikang; CREATE EXTERNAL TABLE fruits (line STRING); LOAD DATA INPATH '/user/yanfei/fruits' OVERWRITE INTO TABLE fruits; CREATE TABLE IF NOT EXISTS fruit_counts AS SELECT name, count(*) AS count from (SELECT explode(split(line, ' ')) AS name FROM fruits) w GROUP BY name ORDER BY name;
#! /usr/bin/env python3.8 import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)])
use yanfeikang; CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA INPATH '/user/yanfei/u.data' OVERWRITE INTO TABLE u_data; add FILE hdfs:///user/yanfei/weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday;
Create an external table with Hive for the data
Use the Hive internal function do basic statistics as we had with Hadoop.
External Reading: Capriolo, Edward, Dean Wampler, and Jason Rutherglen. Programming Hive: Data warehouse and query language for Hadoop. ” O’Reilly Media, Inc.”, 2012.