Hive Fundamentals and Core Concepts

Hive Introduction

What is Hive?

  • Hive is an open-source data warehouse solution originally developed by Facebook that operates on Hadoop infrastructure
  • It provides SQL-like query capabilities (HQL) for structured data stored in HDFS
  • Core functionality involves translating SQL queries into MapReduce jobs
  • Primary use case: batch data analytics with higher development efficiency than raw MapReduce programming

Hive Architecture

System Components

  • User Interfaces: Multiple access methods including:

    • CLI (Command Line Interface)
    • JDBC/ODBC connectors for Java applications
    • Web-based graphical interface
    • HiveServer2 supporting Thrift protocol for remote client connections in various programming languages
  • Metadata Storage: Utilizes relational databases (MySQL, Derby) for storing:

    • Table definitions and schemas
    • Column specifications
    • Partition information
    • Table properties (external/internal designation)
    • Data location mappings
  • Query Processing: Comprehensive pipeline including:

    • Parser, compiler, optimizer, and executor components
    • Query plan generation stored in HDFS
    • Execution via MapReduce framework

Hive and Hadoop Integration

Hive leverages HDFS for data storage and MapReduce for query processing. As a data warehouse tool rather than a cluster service, Hive installation is typically confined to Hadoop master nodes.

Hive vs Traditional Databases

Hive specializes in large-scale batch data processing with key distinctions:

Feature Hive RDBMS
SQL Compliance Partial Full ANSI SQL
Data Updates INSERT OVERWRITE/INTO UPDATE/INSERT/DELETE
Transactions Not supported (default) Supported
Schema Approach Read-time validation Write-time validation
Query Language HQL SQL
Storage HDFS Local/RAW devices
Execution Engine MapReduce Database Engine
Latency High Low
Subqueries FROM clause only Full support
Data Volume Petabyte-scale Smaller datasets
Scalability High Limited
Indexing Bitmap indexes (v0.8+) Complex indexing

Data Types

  • Primitive Types: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE
  • Complex Types: ARRAY, MAP, STRUCT

Table Types

  • Managed Tables (Internal Tables)
  • External Tables

Hive Data Model

  • All data resides in HDFS without predefined storage format
  • Schema-on-read approach with custom delimiters
  • Data model hierarchy:
    • Database: HDFS directory under hive.metastore.warehouse.dir
    • Table: Subdirectory within database
    • External Table: Custom HDFS path location
    • Partition: Table subdirectory for data segmentation
    • Bucket: Hash-distributed files within table directory

Hive Installation

Prerequisites include JDK and Hadoop installation with proper environment configuraton.

Installation Steps

  1. Download and extract Hive distribution:
tar -xzf hive-1.1.0-cdh5.7.0.tar.gz -C ~/applications/
  1. Configure environment file:
cd hive-1.1.0-cdh5.7.0/conf/
cp hive-env.sh.template hive-env.sh
vi hive-env.sh

Add Hadoop home directory:

HADOOP_HOME=/home/user/applications/hadoop-2.6.0-cdh5.7.0
  1. Set environment variables:
export HIVE_HOME=/home/user/applications/hive-1.1.0-cdh5.7.0
export PATH=$HIVE_HOME/bin:$PATH

Hive Operations

Basic HQL Operations

Create database:

CREATE DATABASE analytics;

List databases:

SHOW DATABASES;

Create table with custom delimiter:

CREATE TABLE student_records(class_id STRING, student_id STRING, grade INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Load data from local file:

LOAD DATA LOCAL INPATH '/home/user/data/students.csv' 
OVERWRITE INTO TABLE student_records;

Query data:

SELECT * FROM student_records;

Aggregation with grouping:

SELECT class_id, COUNT(grade) 
FROM student_records 
WHERE grade >= 60 
GROUP BY class_id;

Managed vs External Tables

Aspect Managed Table External Table
Creation No EXTERNAL keyword EXTERNAL keyword required
Data Management Hive-managed HDFS-managed
Storage Location Warehouse directory Custom HDFS path
Deletion Impact Metadata and data removed Only metadata removed
Schema Changes Direct metadata sync Requires table repair

External table creation:

CREATE EXTERNAL TABLE external_students (class_id STRING, student_id STRING, grade INT) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION '/hdfs/path/students/';

Partitioned Tables

Partitioning enhances query performance by organizing data into logical segments.

Create partitioned table:

CREATE TABLE employee_data (emp_name STRING, emp_salary BIGINT) 
PARTITIONED BY (hire_date STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

View partitions:

SHOW PARTITIONS employee_data;

Add partition:

ALTER TABLE employee_data ADD IF NOT EXISTS PARTITION(hire_date='2023-12-01');

Load data into partition:

LOAD DATA LOCAL INPATH '/home/user/data/employees.csv' 
INTO TABLE employee_data PARTITION(hire_date='2023-12-01');

Dynamic Partitioning

Enable automatic partition creation during data insertion.

Create target table:

CREATE TABLE employee_dynamic (emp_name STRING, emp_salary BIGINT) 
PARTITIONED BY (hire_date STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n' 
STORED AS TEXTFILE;

Enable dynamic partitioning:

SET hive.exec.dynamic.partition.mode=nonstrict;

Insert with dynamic partitions:

INSERT INTO TABLE employee_dynamic PARTITION(hire_date) 
SELECT emp_name, emp_salary, hire_date FROM employee_data;

Tags: Hive Data Warehouse Hadoop hql Big Data

Posted on Mon, 15 Jun 2026 18:24:52 +0000 by bobbfwed