Big Data Final Review Guide 2023 Beta

Chapter 3: HDFS

http://master:50070

1.1 NameNode

NameNode is the manager. It stores metadata, which is data about data.

Files in HDFS are split into data blocks of 128 MB (originally 64 MB) for storage.

Replication strategy: The default replication factor in HDFS is 3.

1.2 Secondary NameNode

1.3 DataNode

1.4 Client

1.5 File Write Process

  1. NameNode
  2. DataNode
  3. Client

1.6 File Read Process

  1. NameNode
  2. DataNode
  3. Client

Component Mapping:

  • ResourceManager <-> NameNode
  • NodeManager <-> DataNode

2. Common HDFS Commands

1. Create Directory

hadoop fs -mkdir <paths>

Create a single directory:

hadoop fs -mkdir /home/myfile/dir1

Create multiple directories:

hadoop fs -mkdir /home/myfile/dir1 /home/myfile/dir2

2. List Directory Contents

hadoop fs -ls <paths>

Example:

hadoop fs -ls /home/myfile/

List all subdirectories recursively

hadoop fs -ls -R <path> (case-sensitive)

Example:

hadoop fs -ls -R /home/myfile/

3. Upload File

Copy one or more files from the local filesystem to HDFS.

hadoop fs -put <local_files> ... <hdfs_path>

Example:

hadoop fs -put Desktop/test.sh /home/myfile/dir1/

4. Download File

Copy files from HDFS to the local filesystem.

hadoop fs -get <hdfs_paths> <local_path>

Example:

hadoop fs -get /home/myfile/test.sh Downloads/

5. View File Contents

hadoop fs -cat <paths>

Example:

hadoop fs -cat /home/myfile/test.sh

6. Copy File

hadoop fs -cp <source_path> ... <destination_path>

Example:

hadoop fs -cp /home/myfile/test.sh /home/myfile/dir

7. Move File

hadoop fs -mv <source_path> <destination_path>

Example:

hadoop fs -mv /home/myfile/test.sh /home/myfile/dir

8. Delete File

Two main options: -rm for files, -rm -r for directories.

hadoop fs -rm <path>

Example:

hadoop fs -rm /home/myfile/test.sh

The above command only removes files. To delete a directory containing files, use the -r flag.

Usage: hadoop fs -rm -r <path>

Example:

hadoop fs -rm -r /home/myfile/dir

9. View End of File

hadoop fs -tail <path>

Example:

hadoop fs -tail /home/myfile/test.sh

10. Display File Size

hadoop fs -du <path>

Example:

hadoop fs -du /home/myfile/test.sh

11. Count Files

hadoop fs -count <path>

Example:

hadoop fs -count /home/myfile

12. Display File System Details

hadoop fs -df <path>

Example:

hadoop fs -df /home/myfile

13. Merge Files

Copy multiple files from HDFS, merge and sort them into a single local file.

hadoop fs -getmerge <src> <localdst>

Example:

hadoop fs -getmerge /user/hduser0011/test /home/myfile/dir

14. Store Echo Output to HDFS File

echo abc
echo abc | hadoop fs -put - <path>
echo abc | hadoop fs -put - /home/myfile/test.txt

Chapter 4: MapReduce Work Mechanism

1. MapReduce Functionality

MapReduce implements two core functions:

  • Map: Applies a function to every member of a collection.
  • Reduce: Performs an operation on results from multiple processes or independent systems in parallel.

MapReduce Data Flow

2. MapReduce Architecture

Key Concepts: Job, Tasks

It follows a master-slave structure.

Component Mapping:

  • Namenode <-> Datanode
  • ResourceManager <-> NodeManager
  • JobTracker <-> TaskTracker

JobTracker is responsible for:

  • Receiving jobs from clients, decomposing jobs, and monitoring job status.
  • Distributing tasks to TaskTracker nodes for execution.
  • Monitoring the execution of tasks on TaskTracker nodes.

NodeManager: A container for executing applications.

TaskTracker: Acts as the bridge between JobTracker and tasks. It receives and executes commands from JobTracker (e.g., run task, commit task, kill task). It periodically reports the status of local tasks to the JobTracker via heartbeats.

The MapReduce architecture consists of 4 independent node types:

  1. Client
  2. JobTracker
  3. TaskTracker
  4. HDFS

MapReduce Architecture

3. Job Scheduling

  • FIFO Scheduler
  • Fair Scheduler
  • Capacity Scheduler

4. WordCount Process

Input Data -> Split -> Map -> Shuffle -> Reduce

WordCount Process

A split contains information such as <filename, start_position, length, host_list>.

  1. Input data is distributed to nodes via Split.
  2. Each Map task processes one split.
  3. Map tasks output intermediate data.
  4. During the Shuffle phase, data is exchanged between nodes.
  5. Intermediate key-value pairs with the same key are sent to the same Reduce task.
  6. Reduce tasks execute and output the final results.

Chapter 5: Hive

1. Hive Data Model

Metadata is "data about data" or "intermediate data." It describes the properties of data, such as type, structure, history, database, table, and view information. Hive metadata is frequently read, modified, and updated, making it unsuitable for storage in HDFS. Instead, it is typically stored in a relational database.

  • What you see in the Hive command line is metadata.
  • What you see on HDFS is the physical data.

2. Complex Data Types

  • ARRAY: A collection of elements of the same data type, accessible by index.
  • STRUCT: A structure that can contain elements of different data types.
  • MAP: A collection of key-value pairs.

3. Basic Hive Operations

3.1 Entering Hive

start-all.sh

Hive CLI

[zkpk@master ~]$ hive
hive>

3.2 Viewing Tables

Show Tables

Hive commands must end with a semicolon (;).

3.3 Creating a Table

Create Table

3.4 Describing a Table Structure

DESC table_name;

Describe Table

3.5 Viewing Table Data

SELECT * FROM table_name;

Select All

3.6 Importing Data from a File

Create a local file l.txt with data:

1	aaa	f
2	bbb	f
3	ccc	m
4	ddd	f
5	eee	m

Import the data:

Load Data

View the imported data:

Select After Load

3.7 Adding a Column

ALTER TABLE table_name ADD COLUMNS (new_col_name data_type);

Add Column

3.8 Renaming a Column

ALTER TABLE table_name CHANGE col_name new_col_name data_type;

Change Column Name

3.9 Changing Column Type or Position

Change Column Type/Position

3.10 Dropping a Column

ALTER TABLE table_name REPLACE COLUMNS (col1 type, col2 type, col3 type);
-- (Only include columns to keep in the COLUMNS clause)

Replace Columns

3.11 Copying a Table (with data)

CREATE TABLE new_table AS SELECT * FROM existing_table;

Copy Table Verify Copy

3.12 Copying Table Structure (without data)

CREATE TABLE new_table AS SELECT * FROM existing_table WHERE 1=0;

Copy Structure 1 Copy Structure 2

3.13 Renaming a Table

ALTER TABLE table_name RENAME TO new_table_name;

Rename Table

3.14 Truncating a Table

TRUNCATE TABLE table_name;

Truncate Table

3.15 Dropping a Table

DROP TABLE table_name;

Drop Table

3.16 Downloading Hive Table Data to Local

INSERT OVERWRITE LOCAL DIRECTORY '/home/zkpk/directory_name' SELECT * FROM table_name;

Download Data Command

Execute Download

Check the downloaded data:

Check Local 1 Check Local 2

4. Table Storage Location

Tables are stored in HDFS at: /user/hive/warehouse

Warehouse Location Warehouse Contents HDFS Paths More HDFS

5. External vs. Internal Tables

  • Managed Table (Internal Table, Temporary Table): Deleting the table removes both the metadata and the actual data files.
  • External Table: Deleting the table only removes the metadata; the actual data files remain in HDFS.

Creating an external table:

CREATE EXTERNAL TABLE table_name (...);

Create External Table External Table Example External Table HDFS 1 External Table HDFS 2

6. Partitioned Tables

Partitioned tables divide data into multiple directories based on a partition rule (e.g., column value). This improves query performance by allowing the engine to skip irrelevant partitions.

Partition Directory Structure Partition Data Partition Example

Enable dynamic partitioning:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=1000;

Dynamic Partition Config Dynamic Partition Insert 1 Dynamic Partition Insert 2

Partitioned Table Results Partition Files HDFS Partition

7. Bucketed Tables

For each table (or partition), Hive can be further organized into buckets, which is a finer-grained data division.

Creating a bucketed table:

Create Bucket Table Bucket Table Description

Inserting data into a bucketed table:

Insert into Bucket 1 Insert into Bucket 2 Insert into Bucket 3

Bucket Data 1 Bucket Data 2

HDFS Bucket Files HDFS Bucket File Contents

8. Complex Data Types (Examples)

Complex Type Create Complex Type Insert Complex Type Query

Complex Type Final

9. Other DDL Statements

9.1 Creating a Database

Create Database

9.2 Dropping a Database

Hive does not allow dropping a database that contains tables.

DROP TABLE database_name.table_name;
DROP DATABASE database_name;

To force drop a database with tables, add the CASCADE keyword:

DROP DATABASE database_name CASCADE;

10. Views

(Omitted)

11. Common HQL Commands

  1. Select all fields:

    SELECT * FROM lxl;
    
  2. Select specific fields:

    SELECT name, gender FROM lxl;
    
  3. Limit the number of rows returned:

    SELECT * FROM lxl LIMIT 3;
    
  4. Filter with WHERE:

    SELECT * FROM lxl WHERE gender='f';
    
  5. Multiple conditions with WHERE:

    SELECT * FROM lxl WHERE gender='f' AND no=1;
    

    Note: String conditions can be enclosed in either single or double quotes.

  6. Remove duplicates with DISTINCT:

    SELECT DISTINCT age FROM lxlage;
    
  7. Group data with GROUP BY:

    Often used with aggregate functions like MAX(), MIN(), COUNT(), SUM(), AVG().

    GROUP BY Example

  8. Order data with ORDER BY:

    Default is ascending order. Use DESC for descending order.

  9. Pattern matching with LIKE:

    SELECT * FROM lxl WHERE name LIKE '%a%';
    

    This finds names containing the character 'a'.

    Using IN:

    IN Example

  10. Range query with BETWEEN AND:

    SELECT * FROM lxl WHERE no BETWEEN 2 AND 4;
    
  11. Joining tables with JOIN:

    SELECT * FROM lxl JOIN lxlage ON lxl.no = lxlage.no;
    

    Inner Join

    LEFT JOIN (all rows from left table):

    SELECT * FROM lxl LEFT JOIN lxlage ON lxl.no = lxlage.no;
    

    RIGHT JOIN (all rows from right table):

    SELECT * FROM lxl RIGHT JOIN lxlage ON lxl.no = lxlage.no;
    
  12. Combining results with UNION ALL:

    The columns must match in name and type.

    UNION ALL

  13. Filtering groups with HAVING:

    Used with GROUP BY to filter groups. WHERE cannot be used with aggregate functions.

    HAVING Example 1 HAVING Example 2 Find departments with average salary over 3800:

    HAVING Average Salary

    HAVING vs WHERE More HAVING

11. Exiting Hive

quit;

Chapter 6: ZooKeeper - Distributed Coordination System

1. Introduction

ZooKeeper is a distributed coordination service designed to solve consistency problems in distributed systems.

ZooKeeper = File System + Notification Mechanism. (Similar to a resource management system)

Both ZooKeeper and Kafka need to be started on all machines (master and slaves).

Starting Hadoop cluster is not required beforehand.

Log in to master, slave01, and slave02 nodes. Navigate to the ZooKeeper installation directory and start the service:

# On master node
cd zookeeper-3.4.10/
bin/zkServer.sh start

# On slave nodes
cd zookeeper-3.4.10/
bin/zkServer.sh start

Start command: bin/zkServer.sh start

ZooKeeper can handle two types of queues:

  • Synchronization Queue: A queue is only usable when all its members are present. It waits until all members are gathered.
  • FIFO Queue: Follows the First-In-First-Out principle for enqueue and dequeue operations.

2. Persistent vs. Ephemeral Nodes

ZooKeeper has four main types of nodes:

  • PERSISTENT Node: Exists until explicitly deleted. It does not disappear when the client session that created it ends.
  • PERSISTENT_SEQUENTIAL Node: The parent node maintains a sequence counter for its immediate children. When created, ZooKeeper appends a monotonically increasing number to the node name.
  • EPHEMERAL Node: The node's lifecycle is tied to the client session. It is automatically deleted when the session ends (note: session end, not connection close). Ephemeral nodes cannot have children.
  • EPHEMERAL_SEQUENTIAL Node: Similar to an ephemeral node, but ZooKeeper automatically appends a sequence number to the node name upon creation, similar to persistent sequential nodes.

3. get Command: Retrieving Node Data and Metadata

  • cZxid: Transaction ID for node creation
  • ctime: Node creation time
  • mZxid: Transaction ID for last modification
  • mtime: Last modification time
  • pZxid: Transaction ID for children modification
  • cversion: Version number of children
  • dataVersion: Version number of the data stored in the node
  • aclVersion: Version number of the ACL
  • ephemeralOwner: Indicates if the node is ephemeral (if yes, shows session ID; otherwise 0)
  • dataLength: Length of the data
  • numChildren: Number of children

4. Access Control (ACL)

ACL stands for Access Control List.

ZooKeeper nodes have 5 operation permissions: CREATE, READ, WRITE, DELETE, ADMIN (abbreviated as crwda).

Except for DELETE, the other 4 permissions apply to operations on the node itself.

5. Four-Letter Words (Admin Commands)

5.1 stat: View Node Status

[zkpk@master zookeeper-3.4.5]$ su root
[root@master zookeeper-3.4.5]# echo stat | nc 192.168.1.100 2181

5.2 ruok: Check if ZooKeeper is Running

Returns imok if running.

[root@master zookeeper-3.4.5]# echo ruok | nc 192.168.1.100 2181
imok

5.3 dump: List Outstanding and Ephemeral Nodes

[root@master zookeeper-3.4.5]# echo dump | nc 192.168.1.100 2181
SessionTracker dump:
...
Sessions with Ephemerals (0):

5.4 conf: Display Server Configuration

[root@master zookeeper-3.4.5]# echo conf | nc 192.168.1.100 2181
clientPort=2181
dataDir=...
dataLogDir=...
tickTime=2000
...

5.5 cons: List Connection Information

[root@master zookeeper-3.4.5]# echo cons | nc 192.168.1.100 2181
/192.168.0.68:49354[0](queued=0,recved=1,sent=0)

5.6 envi: Display Environment Variables

[root@master zookeeper-3.4.5]# echo envi | nc 192.168.1.100 2181
Environment:
zookeeper.version=...
...

5.7 mntr: Display Cluster Health Metrics

[root@master zookeeper-3.4.5]# echo mntr | nc 192.168.1.100 2181
zk_version   3.4.5-...
zk_avg_latency   0
zk_max_latency   4
...

5.8 wchs: Display Watch Information Summary

[root@master zookeeper-3.4.5]# echo wchs | nc 192.168.1.100 2181
0 connections watching 0 paths
Total watches:0

5.9 wchc and wchp: Display session/path watch details (may be restricted)

[root@master zookeeper-3.4.5]# echo wchc | nc 192.168.1.100 2181
wchc is not executed because it is not in the whitelist.

6. Stopping ZooKeeper

Stop ZooKeeper on all nodes (master and slaves).

Stop ZooKeeper

Chapter 7: Kafka

1. Kafka Concepts

Kafka is a high-throughput, distributed publish-subscribe messaging system.

  • Producer: Message producer, responsible for publishing messages to Kafka.
  • Consumer: Message consumer, reads messages from Kafka.

Kafka Producers and Consumers

  • Broker: A server node in the Kafka cluster. Each Kafka node is a broker.

Kafka Brokers and Topics

  • Message: The data unit in Kafka. Has metadata and a key.
  • Partition: A physical concept for horizontal scalability. Topics are split into partitions.
  • Topic: A category or feed name to which messages are published.
  • Segment: Partitions are physically composed of multiple segments, each storing messages.

2. Kafka Features

  • Persistence: Messages are persisted to disk, enabling batch consumption.
  • Distributed System: Easily scalable.
  • Online and Offline Support: Suitable for both real-time and batch processing.
  • Compression Support: Supports snappy, gzip.

3. Common Commands

Start ZooKeeper on all nodes (master and slaves), then start Kafka on all nodes.

Start ZooKeeper

Start Kafka

Create a topic named test with 1 partition and replication factor of 1:

[zkpk@master kafka_2.11-0.10.2.1]$ bin/kafka-topics.sh --create --zookeeper master:2181 --replication-factor 1 --partitions 1 --topic test

Topic Creation

Start a console producer on master:

[zkpk@master kafka_2.11-0.10.2.1]$ bin/kafka-console-producer.sh --broker-list master:9092 --topic test

Start Producer

Start a console consumer on slave01:

[zkpk@master kafka_2.11-0.10.2.1]$ bin/kafka-console-consumer.sh --zookeeper master:2181 --topic test --from-beginning

Start Consumer

List all topics:

bin/kafka-topics.sh --list --zookeeper master:2181

List Topics

Describe topic details:

bin/kafka-topics.sh --describe --zookeeper master:2181 --topic test

Describe Topic

4. Shutdown

bin/kafka-server-stop.sh
bin/zkServer.sh stop

Chapter 8: HBase

1. Introduction

HBase is a highly reliablee, high-performance, column-oriented, scalable, real-time read/write distributed database. It is an important component of the Hadoop ecosystem.

2. Differences from Traditional RDBMS

HBase differs from traditional relational databases in several key aspects:

  1. Data Types: HBase primarily stores strings.
  2. Data Operations: HBase lacks complex table relationships and uses simple operations like insert, query, delete, and truncate.
  3. Storage Model: RDBMS is row-oriented; HBase is column-oriented.
  4. Data Indexing: HBase has a single index: the row key.
  5. Data Maintenance: Update operations in HBase do not delete the old version of data; they create a new version, retaining old ones.
  6. Scalability: HBase is designed for horizontal scalability.

3. Common Commands

3.1 Startup

Standalone Mode:

start-hbase.sh
stop-hbase.sh

Pseudo-distributed Mode:

start-all.sh
start-hbase.sh
# jps should show: master -> HMaster, slave -> HRegionServer

Fully-distributed Mode:

start-all.sh

zkServer.sh start
# (Start on all nodes, jps shows QuorumPeerMain)

zkServer.sh status
# Returns follower or leader for each node

start-hbase.sh  # (Execute on master node)

3.2 Startup Sequence 2

start-all.sh
start-hbase.sh

HBase Start 1 HBase Start 2

3.3 Entering HBase Shell

...

Tags: Big Data HDFS mapreduce Hive ZooKeeper

Posted on Wed, 01 Jul 2026 16:57:39 +0000 by james13009