Big Data Fundamentals and Core Technologies Overview

HDFS File System Commands

Disk Usage Information

Retrieve disk usage statistics for a specific path:

hadoop fs -df /home/myfile

Merge Files

Combine multiple files from HDFS into a single local file:

hadoop fs -getmerge /user/hduser0011/test /home/myfile/dir

Write Output to HDFS

Direct console output to an HDFS file:

echo abc | hadoop fs -put - /path/to/file

MapReduce Architecture

Functional Components

The MapReduce framework performs two primary operations:

Map: Applies a functon to each element in a dataset.
Reduce: Aggregates results from parallel processes.

System Components

The architecture consists of four core components:

Client: Initiates jobs and mannages communication.
JobTracker: Coordinates job execution and resource allocation.
TaskTracker: Executes tasks assigned by the JobTracker.
HDFS: Stores data blocks across the cluster.

Scheduling Mechenisms

Several scheduling policies are available:

FIFO Scheduler
Fair Scheduler
Capacity Scheduler

WordCount Workflow

The typical processing flow includes:

Input data is split into chunks.
Each chunk is processed by a Map task.
Intermediate key-value pairs are generated.
Shuffle phase redistributes data based on keys.
Reduce tasks aggregate values associated with identical keys.
Final output is produced.

Hive Data Warehouse

Metadata Management

Metadata describes data characteristics such as types, structures, and relationships. It's stored in relational databases rather than HDFS for performance reasons.

Complex Data Types

Array: Ordered collection of elements of the same type.
Struct: Composite type containing heterogeneous fields.
Map: Key-value pair storage mechanism.

Common Operations

Initialize Hive Session

start-all.sh
hive

Table Manipulation

Create a new table:

CREATE TABLE table_name (column1 datatype, column2 datatype);

Display table schema:

desc table_name;

Load data from a local file:

LOAD DATA LOCAL INPATH '/path/to/l.txt' INTO TABLE table_name;

Add a column:

ALTER TABLE table_name ADD COLUMNS (new_column datatype);

Modify column definition:

ALTER TABLE table_name CHANGE old_column new_column datatype;

Remove a column:

ALTER TABLE table_name REPLACE COLUMNS (col1 datatype, col2 datatype);

Duplicate table structure:

CREATE TABLE new_table AS SELECT * FROM existing_table WHERE 1=0;

Rename a table:

ALTER TABLE table_name RENAME TO new_table_name;

Clear all records:

TRUNCATE TABLE table_name;

Drop a table:

DROP TABLE table_name;

Export data to local system:

INSERT OVERWRITE LOCAL DIRECTORY '/local/path' SELECT * FROM table_name;

Storage Location

Tables are physically stored under: /user/hive/warehouse

External vs Internal Tables

Internal tables delete both metadata and physical files when dropped. External tables preserve physical files upon deletion.

Create an external table:

CREATE EXTERNAL TABLE table_name (...);

Partitioned Tables

Enable dynamic partitioning:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;

Bucketed Tables

Data is divided into buckets for improved query performance.

Apache ZooKeeper

Core Concepts

ZooKeeper provides coordination services for distributed systems, functioning as a hierarchical filesystem with notification capabilities.

Node Types

Persistent Nodes: Remain until explicitly deleted.
Persistent Sequential Nodes: Automatically numbered upon creation.
Ephemeral Nodes: Removed when the session ends.
Ephemeral Sequential Nodes: Combine ephemeral nature with automatic numbering.

Command Set

Use netcat to interact with ZooKeeper server:

Check server status:

echo stat | nc host port

Verify service availability:

echo ruok | nc host port

View node details:

echo dump | nc host port

Show configuration:

echo conf | nc host port

Display connection info:

echo cons | nc host port

View environment details:

echo envi | nc host port

Monitor health metrics:

echo mntr | nc host port

Apache Kafka

Basic Concepts

Kafka is a high-throughput distributed messaging platform:

Producers: Send messages to topics.
Consumers: Read messages from topics.
Brokers: Kafka servers.
Topics: Message categories.
Partitions: Physical divisions within topics.

Key Features

Persistent storage
Scalable architecture
Supports batch and real-time processing
Compression support (snappy, gzip)

Essential Commands

Create a topic:

bin/kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 1 --topic test

Start producer:

bin/kafka-console-producer.sh --broker-list master:9092 --topic test

Start consumer:

bin/kafka-console-consumer.sh --zookeeper master:2181 --topic test --from-beginning

List topics:

bin/kafka-topics.sh --list --zookeeper master:2181

Describe topic:

bin/kafka-topics.sh --describe --zookeeper master:2181 --topic test

Apache HBase

Characteristics

HBase is a distributed, scalable, non-relational database built on top of HDFS with column-oriented storage.

Differences from Relational DBs

Column-based storage instead of row-based
No complex joins
Single index (row key)
Versioned data model
Horizontal scalability

Shell Operations

Start HBase shell:

hbase shell

Create a table:

create 'students', 'info', 'grades'

List tables:

list

Describe table structure:

desc 'students'

Scan entire table:

scan 'students'

Count rows:

count 'students'

Insert data:

put 'students', 'row1', 'info:name', 'Alice'
put 'students', 'row1', 'grades:math', '95'

Retrieve specific row:

get 'students', 'row1'

Update existing value:

put 'students', 'row1', 'grades:math', '98'

Delete a column:

delete 'students', 'row1', 'grades:math'

Add column family:

alter 'students', 'contact'

Disable table:

disable 'students'

Enable table:

enable 'students'

Check table status:

exists 'students'

Verify enabled/disabled state:

is_enabled 'students'
is_disabled 'students'

Tags: bigdata Hadoop mapreduce Hive ZooKeeper

Posted on Thu, 11 Jun 2026 18:46:45 +0000 by brucensal

Freaks City