Hadoop Distributed System Fundamentals and Cluster Setup

Big Data Processing Overview

Big data involves analyzing massive datasets to extract valuable insights for organizational decision-making. Core processing stages include:

Data acquisition
Data processing
Result visualization

Hadoop Framework

Hadoop provides distributed processing capabilities for large datasets across computer clusters. Its architecture addresses two fundamental challenges:

Distributed Storage (HDFS)

Large files exceeding single-node capacity are segmented into blocks (typically 128MB) distributed across nodes. A master NameNode manages file metadata and coordinates access.

Distributed Computation (MapReduce)

Computational tasks are divided into smaller units executed in parallel across cluster nodes. The YARN resource manager handles task scheduling and resource allocation.

Core Components

HDFS: Distributed file system for high-throughput data access
MapReduce: Parallel processing framework for large datasets
YARN: Resource management and job scheduling
Comon Utilities: Foundational libraries and tools

Cluster Environment Configuration

Infrastructure Planning

IP	Hostname	Roles
192.168.174.100	node01	NameNode, ResourceManager, ZK
192.168.174.120	node02	DataNode, NodeManager, ZK
192.168.174.130	node03	DataNode, NodeManager, ZK

System Configuration

Disable firewalls: ``` systemctl stop firewalld systemctl disable firewalld
Disable SELinux: ``` sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
Configure hostnames and network mappings

SSH Key Authantication

Generate keys on all nodes: ``` ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Distribute public keys: ``` ssh-copy-id node01

Time Synchronization

yum install -y ntp
crontab -e
*/1 * * * * /usr/sbin/ntpdate pool.ntp.org

Software Installation

Java Development Kit

tar -zxvf jdk-8u221-linux-x64.tar.gz -C /opt/
echo 'export JAVA_HOME=/opt/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH' >> /etc/profile
source /etc/profile

Zookeeper Cluster

Confiugre zoo.cfg: ``` dataDir=/data/zookeeper server.1=node01:2888:3888 server.2=node02:2888:3888 server.3=node03:2888:3888
Initialize node identities: ```
On node01:
echo 1 > /data/zookeeper/myid
On node02:
echo 2 > /data/zookeeper/myid
On node03:
echo 3 > /data/zookeeper/myid
Start the ensemble: ``` zkServer.sh start

Hadoop Cluster Deployment

Configuration Files

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node01:8020</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/hadoop/tmp</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/namenode</value>
  </property>
</configuration>

yarn-site.xml

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>node01</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Cluster Initialization

Format HDFS: ``` hdfs namenode -format
Start services: ``` start-dfs.sh start-yarn.sh mapred --daemon start historyserver

Verification

HDFS Web UI: http://node01:9870
YARN Web UI: http://node01:8088

Tags: Hadoop HDFS mapreduce YARN ZooKeeper

Posted on Tue, 26 May 2026 01:24:57 +0000 by SidewinderX

Freaks City