Hive Fundamentals and Core Concepts

Hive Introduction

What is Hive?

Hive is an open-source data warehouse solution originally developed by Facebook that operates on Hadoop infrastructure
It provides SQL-like query capabilities (HQL) for structured data stored in HDFS
Core functionality involves translating SQL queries into MapReduce jobs
Primary use case: batch data analytics with higher development efficiency than raw MapReduce programming

Hive Architecture

System Components

User Interfaces: Multiple access methods including:
- CLI (Command Line Interface)
- JDBC/ODBC connectors for Java applications
- Web-based graphical interface
- HiveServer2 supporting Thrift protocol for remote client connections in various programming languages
Metadata Storage: Utilizes relational databases (MySQL, Derby) for storing:
- Table definitions and schemas
- Column specifications
- Partition information
- Table properties (external/internal designation)
- Data location mappings
Query Processing: Comprehensive pipeline including:
- Parser, compiler, optimizer, and executor components
- Query plan generation stored in HDFS
- Execution via MapReduce framework

Hive and Hadoop Integration

Hive leverages HDFS for data storage and MapReduce for query processing. As a data warehouse tool rather than a cluster service, Hive installation is typically confined to Hadoop master nodes.

Hive vs Traditional Databases

Hive specializes in large-scale batch data processing with key distinctions:

Feature	Hive	RDBMS
SQL Compliance	Partial	Full ANSI SQL
Data Updates	INSERT OVERWRITE/INTO	UPDATE/INSERT/DELETE
Transactions	Not supported (default)	Supported
Schema Approach	Read-time validation	Write-time validation
Query Language	HQL	SQL
Storage	HDFS	Local/RAW devices
Execution Engine	MapReduce	Database Engine
Latency	High	Low
Subqueries	FROM clause only	Full support
Data Volume	Petabyte-scale	Smaller datasets
Scalability	High	Limited
Indexing	Bitmap indexes (v0.8+)	Complex indexing

Data Types

Primitive Types: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE
Complex Types: ARRAY, MAP, STRUCT

Table Types

Managed Tables (Internal Tables)
External Tables

Hive Data Model

All data resides in HDFS without predefined storage format
Schema-on-read approach with custom delimiters
Data model hierarchy:
- Database: HDFS directory under hive.metastore.warehouse.dir
- Table: Subdirectory within database
- External Table: Custom HDFS path location
- Partition: Table subdirectory for data segmentation
- Bucket: Hash-distributed files within table directory

Hive Installation

Prerequisites include JDK and Hadoop installation with proper environment configuraton.

Installation Steps

Download and extract Hive distribution:

tar -xzf hive-1.1.0-cdh5.7.0.tar.gz -C ~/applications/

Configure environment file:

cd hive-1.1.0-cdh5.7.0/conf/
cp hive-env.sh.template hive-env.sh
vi hive-env.sh

Add Hadoop home directory:

HADOOP_HOME=/home/user/applications/hadoop-2.6.0-cdh5.7.0

Set environment variables:

export HIVE_HOME=/home/user/applications/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH

Hive Operations

Basic HQL Operations

Create database:

CREATE DATABASE analytics;

List databases:

SHOW DATABASES;

Create table with custom delimiter:

CREATE TABLE student_records(class_id STRING, student_id STRING, grade INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load data from local file:

LOAD DATA LOCAL INPATH '/home/user/data/students.csv' 
OVERWRITE INTO TABLE student_records;

Query data:

SELECT * FROM student_records;

Aggregation with grouping:

SELECT class_id, COUNT(grade) 
FROM student_records 
WHERE grade >= 60 
GROUP BY class_id;

Managed vs External Tables

Aspect	Managed Table	External Table
Creation	No EXTERNAL keyword	EXTERNAL keyword required
Data Management	Hive-managed	HDFS-managed
Storage Location	Warehouse directory	Custom HDFS path
Deletion Impact	Metadata and data removed	Only metadata removed
Schema Changes	Direct metadata sync	Requires table repair

External table creation:

CREATE EXTERNAL TABLE external_students (class_id STRING, student_id STRING, grade INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION '/hdfs/path/students/';

Partitioned Tables

Partitioning enhances query performance by organizing data into logical segments.

Create partitioned table:

CREATE TABLE employee_data (emp_name STRING, emp_salary BIGINT) 
PARTITIONED BY (hire_date STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

View partitions:

SHOW PARTITIONS employee_data;

Add partition:

ALTER TABLE employee_data ADD IF NOT EXISTS PARTITION(hire_date='2023-12-01');

Load data into partition:

LOAD DATA LOCAL INPATH '/home/user/data/employees.csv' 
INTO TABLE employee_data PARTITION(hire_date='2023-12-01');

Dynamic Partitioning

Enable automatic partition creation during data insertion.

Create target table:

CREATE TABLE employee_dynamic (emp_name STRING, emp_salary BIGINT) 
PARTITIONED BY (hire_date STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

Enable dynamic partitioning:

SET hive.exec.dynamic.partition.mode=nonstrict;

Insert with dynamic partitions:

INSERT INTO TABLE employee_dynamic PARTITION(hire_date) 
SELECT emp_name, emp_salary, hire_date FROM employee_data;

Tags: Hive Data Warehouse Hadoop hql Big Data

Posted on Mon, 15 Jun 2026 18:24:52 +0000 by bobbfwed

Freaks City