Hive Introduction
What is Hive?
- Hive is an open-source data warehouse solution originally developed by Facebook that operates on Hadoop infrastructure
- It provides SQL-like query capabilities (HQL) for structured data stored in HDFS
- Core functionality involves translating SQL queries into MapReduce jobs
- Primary use case: batch data analytics with higher development efficiency than raw MapReduce programming
Hive Architecture
System Components
-
User Interfaces: Multiple access methods including:
- CLI (Command Line Interface)
- JDBC/ODBC connectors for Java applications
- Web-based graphical interface
- HiveServer2 supporting Thrift protocol for remote client connections in various programming languages
-
Metadata Storage: Utilizes relational databases (MySQL, Derby) for storing:
- Table definitions and schemas
- Column specifications
- Partition information
- Table properties (external/internal designation)
- Data location mappings
-
Query Processing: Comprehensive pipeline including:
- Parser, compiler, optimizer, and executor components
- Query plan generation stored in HDFS
- Execution via MapReduce framework
Hive and Hadoop Integration
Hive leverages HDFS for data storage and MapReduce for query processing. As a data warehouse tool rather than a cluster service, Hive installation is typically confined to Hadoop master nodes.
Hive vs Traditional Databases
Hive specializes in large-scale batch data processing with key distinctions:
| Feature | Hive | RDBMS |
|---|---|---|
| SQL Compliance | Partial | Full ANSI SQL |
| Data Updates | INSERT OVERWRITE/INTO | UPDATE/INSERT/DELETE |
| Transactions | Not supported (default) | Supported |
| Schema Approach | Read-time validation | Write-time validation |
| Query Language | HQL | SQL |
| Storage | HDFS | Local/RAW devices |
| Execution Engine | MapReduce | Database Engine |
| Latency | High | Low |
| Subqueries | FROM clause only | Full support |
| Data Volume | Petabyte-scale | Smaller datasets |
| Scalability | High | Limited |
| Indexing | Bitmap indexes (v0.8+) | Complex indexing |
Data Types
- Primitive Types: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE
- Complex Types: ARRAY, MAP, STRUCT
Table Types
- Managed Tables (Internal Tables)
- External Tables
Hive Data Model
- All data resides in HDFS without predefined storage format
- Schema-on-read approach with custom delimiters
- Data model hierarchy:
- Database: HDFS directory under hive.metastore.warehouse.dir
- Table: Subdirectory within database
- External Table: Custom HDFS path location
- Partition: Table subdirectory for data segmentation
- Bucket: Hash-distributed files within table directory
Hive Installation
Prerequisites include JDK and Hadoop installation with proper environment configuraton.
Installation Steps
- Download and extract Hive distribution:
tar -xzf hive-1.1.0-cdh5.7.0.tar.gz -C ~/applications/
- Configure environment file:
cd hive-1.1.0-cdh5.7.0/conf/
cp hive-env.sh.template hive-env.sh
vi hive-env.sh
Add Hadoop home directory:
HADOOP_HOME=/home/user/applications/hadoop-2.6.0-cdh5.7.0
- Set environment variables:
export HIVE_HOME=/home/user/applications/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH
Hive Operations
Basic HQL Operations
Create database:
CREATE DATABASE analytics;
List databases:
SHOW DATABASES;
Create table with custom delimiter:
CREATE TABLE student_records(class_id STRING, student_id STRING, grade INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Load data from local file:
LOAD DATA LOCAL INPATH '/home/user/data/students.csv'
OVERWRITE INTO TABLE student_records;
Query data:
SELECT * FROM student_records;
Aggregation with grouping:
SELECT class_id, COUNT(grade)
FROM student_records
WHERE grade >= 60
GROUP BY class_id;
Managed vs External Tables
| Aspect | Managed Table | External Table |
|---|---|---|
| Creation | No EXTERNAL keyword | EXTERNAL keyword required |
| Data Management | Hive-managed | HDFS-managed |
| Storage Location | Warehouse directory | Custom HDFS path |
| Deletion Impact | Metadata and data removed | Only metadata removed |
| Schema Changes | Direct metadata sync | Requires table repair |
External table creation:
CREATE EXTERNAL TABLE external_students (class_id STRING, student_id STRING, grade INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/hdfs/path/students/';
Partitioned Tables
Partitioning enhances query performance by organizing data into logical segments.
Create partitioned table:
CREATE TABLE employee_data (emp_name STRING, emp_salary BIGINT)
PARTITIONED BY (hire_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
View partitions:
SHOW PARTITIONS employee_data;
Add partition:
ALTER TABLE employee_data ADD IF NOT EXISTS PARTITION(hire_date='2023-12-01');
Load data into partition:
LOAD DATA LOCAL INPATH '/home/user/data/employees.csv'
INTO TABLE employee_data PARTITION(hire_date='2023-12-01');
Dynamic Partitioning
Enable automatic partition creation during data insertion.
Create target table:
CREATE TABLE employee_dynamic (emp_name STRING, emp_salary BIGINT)
PARTITIONED BY (hire_date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
Enable dynamic partitioning:
SET hive.exec.dynamic.partition.mode=nonstrict;
Insert with dynamic partitions:
INSERT INTO TABLE employee_dynamic PARTITION(hire_date)
SELECT emp_name, emp_salary, hire_date FROM employee_data;