Hive

What is Hive

  • Hive is Data warehousing tool developed on top of Hadoop Distributed File System (HDFS)
  • Hive makes job easy for performing operations like
    • Data encapsulation
    • Ad-hoc queries
    • Analysis of huge datasets
  • Hive provides a mechanism to project structure onto the data and perform queries written in HQL (Hive Query Language)
  • Queries or HQL gets converted to map reduce jobs by the Hive compiler
  • Targeted towards users who are comfortable with SQL
  • Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions (UDF)

How to run HQL?

  • Run hive in worker-0.
  • hive -e "SHOW DATABSES".
  • hive -f myhql.hql.

DDL

DDL: databases

  • SHOW DATABASES;.
  • CREATE DATABASE IF NOT EXISTS myname;
  • DROP DATABASE IF EXISTS myname;
  • CREATE DATABASE IF NOT EXISTS myname LOCATION '/user/yanfeikang/hive';
  • SHOW DATABASES;.
  • SHOW DATABASES LIKE "d*";.
  • USE myname.

DDL: CREATE

CREATE TABLE IF NOT EXISTS myname.employees ( name STRING COMMENT 'Employee name', salary FLOAT COMMENT 'Employee salary', subordinates ARRAY<STRING> COMMENT 'Names of subordinates', deductions MAP<STRING, FLOAT> COMMENT 'Keys are deductions names, values are percentages', address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT 'Home address') COMMENT 'Description of the table' TBLPROPERTIES ('creator'='me', 'created_at'='2012-01-02 10:00:00');

SHOW TABLES; DESCRIBE myname.employees;

DDL: ALTER

ALTER TABLE test_table_1 RENAME TO new_table;

ALTER TABLE test_table_1 ADD COLUMNS (stock_name string);

DDL: DROP

DROP TABLE test_table_1;

DDL: SHOW

SHOW FUNCTIONS [like "str*"];

SHOW DATABASES [like "h*"];

SHOW TABLES [like "m*"];

DDL: DESCRIBE

DESCRIBE DATABASE bdg_financial;

DESCRIBE test_table_1;

DESCRIBE FUNCTION like;

DML

DML:data import

LOAD DATA LOCAL inpath '/user/yanfeikang/stocks.txt' INTO TABLE stocks;

DML: create an external table

create external table if not exists stocksext ( symbol string, ymd string, price_open float, price_high float, price_low float, price_close float, volume int, price_adj_close float ) row format delimited fields terminated by ',' location '/user/yanfeikang/hive';

LOAD DATA INPATH '/user/yanfeikang/stocks.txt' INTO TABLE stocksext;

DML: queries

  • SELECT avg(price_close) FROM stocksext WHERE symbol = 'AAPL';
  • SELECT * FROM stocksext WH ymd = '2003-01-02';
  • SELECT symbol, avg(price_close) FROM stocksext GROUP BY symbol;

Wordcount example using hive

CREATE TABLE news (line STRING); LOAD DATA INPATH '/user/yanfeikang/news.txt' OVERWRITE INTO TABLE news; CREATE TABLE word_counts AS SELECT word, count(1) AS count from (SELECT explode(split(line, '\\s')) AS word FROM news) w GROUP BY word ORDER BY word;

Transform: hive and python

When using hive, we can use python to process data. Now let's look at an example.

Python script:

import sys for line in sys.stdin: (employeeid,firstname,lastname) = line.split('\t') print(employeeid+'\t'+firstname+','+lastname)

Hive Script:

create table employees (employee_id int,first_name string,last_name string) stored as avro; insert into employees values(1,'donghua','luo'),(2,'larry','elison'),(3,'tom','kyte');

add file /tmp/employees.py; select transform(employee_id,first_name,last_name) using 'python employees.py' as (employee_id,full_name) from employees;