HDFS File System Commands
Disk Usage Information
Retrieve disk usage statistics for a specific path:
hadoop fs -df /home/myfile
Merge Files
Combine multiple files from HDFS into a single local file:
hadoop fs -getmerge /user/hduser0011/test /home/myfile/dir
Write Output to HDFS
Direct console output to an HDFS file:
echo abc | hadoop fs -put - /path/to/file
MapReduce Architecture
Functional Components
The MapReduce framework performs two primary operations:
- Map: Applies a functon to each element in a dataset.
- Reduce: Aggregates results from parallel processes.
System Components
The architecture consists of four core components:
- Client: Initiates jobs and mannages communication.
- JobTracker: Coordinates job execution and resource allocation.
- TaskTracker: Executes tasks assigned by the JobTracker.
- HDFS: Stores data blocks across the cluster.
Scheduling Mechenisms
Several scheduling policies are available:
- FIFO Scheduler
- Fair Scheduler
- Capacity Scheduler
WordCount Workflow
The typical processing flow includes:
- Input data is split into chunks.
- Each chunk is processed by a Map task.
- Intermediate key-value pairs are generated.
- Shuffle phase redistributes data based on keys.
- Reduce tasks aggregate values associated with identical keys.
- Final output is produced.
Hive Data Warehouse
Metadata Management
Metadata describes data characteristics such as types, structures, and relationships. It's stored in relational databases rather than HDFS for performance reasons.
Complex Data Types
- Array: Ordered collection of elements of the same type.
- Struct: Composite type containing heterogeneous fields.
- Map: Key-value pair storage mechanism.
Common Operations
Initialize Hive Session
start-all.sh
hive
Table Manipulation
Create a new table:
CREATE TABLE table_name (column1 datatype, column2 datatype);
Display table schema:
desc table_name;
Load data from a local file:
LOAD DATA LOCAL INPATH '/path/to/l.txt' INTO TABLE table_name;
Add a column:
ALTER TABLE table_name ADD COLUMNS (new_column datatype);
Modify column definition:
ALTER TABLE table_name CHANGE old_column new_column datatype;
Remove a column:
ALTER TABLE table_name REPLACE COLUMNS (col1 datatype, col2 datatype);
Duplicate table structure:
CREATE TABLE new_table AS SELECT * FROM existing_table WHERE 1=0;
Rename a table:
ALTER TABLE table_name RENAME TO new_table_name;
Clear all records:
TRUNCATE TABLE table_name;
Drop a table:
DROP TABLE table_name;
Export data to local system:
INSERT OVERWRITE LOCAL DIRECTORY '/local/path' SELECT * FROM table_name;
Storage Location
Tables are physically stored under: /user/hive/warehouse
External vs Internal Tables
Internal tables delete both metadata and physical files when dropped. External tables preserve physical files upon deletion.
Create an external table:
CREATE EXTERNAL TABLE table_name (...);
Partitioned Tables
Enable dynamic partitioning:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;
Bucketed Tables
Data is divided into buckets for improved query performance.
Apache ZooKeeper
Core Concepts
ZooKeeper provides coordination services for distributed systems, functioning as a hierarchical filesystem with notification capabilities.
Node Types
- Persistent Nodes: Remain until explicitly deleted.
- Persistent Sequential Nodes: Automatically numbered upon creation.
- Ephemeral Nodes: Removed when the session ends.
- Ephemeral Sequential Nodes: Combine ephemeral nature with automatic numbering.
Command Set
Use netcat to interact with ZooKeeper server:
Check server status:
echo stat | nc host port
Verify service availability:
echo ruok | nc host port
View node details:
echo dump | nc host port
Show configuration:
echo conf | nc host port
Display connection info:
echo cons | nc host port
View environment details:
echo envi | nc host port
Monitor health metrics:
echo mntr | nc host port
Apache Kafka
Basic Concepts
Kafka is a high-throughput distributed messaging platform:
- Producers: Send messages to topics.
- Consumers: Read messages from topics.
- Brokers: Kafka servers.
- Topics: Message categories.
- Partitions: Physical divisions within topics.
Key Features
- Persistent storage
- Scalable architecture
- Supports batch and real-time processing
- Compression support (snappy, gzip)
Essential Commands
Create a topic:
bin/kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 1 --topic test
Start producer:
bin/kafka-console-producer.sh --broker-list master:9092 --topic test
Start consumer:
bin/kafka-console-consumer.sh --zookeeper master:2181 --topic test --from-beginning
List topics:
bin/kafka-topics.sh --list --zookeeper master:2181
Describe topic:
bin/kafka-topics.sh --describe --zookeeper master:2181 --topic test
Apache HBase
Characteristics
HBase is a distributed, scalable, non-relational database built on top of HDFS with column-oriented storage.
Differences from Relational DBs
- Column-based storage instead of row-based
- No complex joins
- Single index (row key)
- Versioned data model
- Horizontal scalability
Shell Operations
Start HBase shell:
hbase shell
Create a table:
create 'students', 'info', 'grades'
List tables:
list
Describe table structure:
desc 'students'
Scan entire table:
scan 'students'
Count rows:
count 'students'
Insert data:
put 'students', 'row1', 'info:name', 'Alice'
put 'students', 'row1', 'grades:math', '95'
Retrieve specific row:
get 'students', 'row1'
Update existing value:
put 'students', 'row1', 'grades:math', '98'
Delete a column:
delete 'students', 'row1', 'grades:math'
Add column family:
alter 'students', 'contact'
Disable table:
disable 'students'
Enable table:
enable 'students'
Check table status:
exists 'students'
Verify enabled/disabled state:
is_enabled 'students'
is_disabled 'students'