Hadoop Distributed System Fundamentals and Cluster Setup

Big Data Processing Overview

Big data involves analyzing massive datasets to extract valuable insights for organizational decision-making. Core processing stages include:

  1. Data acquisition
  2. Data processing
  3. Result visualization

Hadoop Framework

Hadoop provides distributed processing capabilities for large datasets across computer clusters. Its architecture addresses two fundamental challenges:

Distributed Storage (HDFS)

Large files exceeding single-node capacity are segmented into blocks (typically 128MB) distributed across nodes. A master NameNode manages file metadata and coordinates access.

Distributed Computation (MapReduce)

Computational tasks are divided into smaller units executed in parallel across cluster nodes. The YARN resource manager handles task scheduling and resource allocation.

Core Components

  • HDFS: Distributed file system for high-throughput data access
  • MapReduce: Parallel processing framework for large datasets
  • YARN: Resource management and job scheduling
  • Comon Utilities: Foundational libraries and tools

Cluster Environment Configuration

Infrastructure Planning

IP Hostname Roles
192.168.174.100 node01 NameNode, ResourceManager, ZK
192.168.174.120 node02 DataNode, NodeManager, ZK
192.168.174.130 node03 DataNode, NodeManager, ZK

System Configuration

  1. Disable firewalls: ``` systemctl stop firewalld systemctl disable firewalld
  2. Disable SELinux: ``` sed -i 's/SELINUX=enforcing/SELINUX=disabled/g' /etc/selinux/config
  3. Configure hostnames and network mappings

SSH Key Authantication

  1. Generate keys on all nodes: ``` ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  2. Distribute public keys: ``` ssh-copy-id node01
    
    

Time Synchronization

yum install -y ntp
crontab -e
*/1 * * * * /usr/sbin/ntpdate pool.ntp.org

Software Installation

Java Development Kit

tar -zxvf jdk-8u221-linux-x64.tar.gz -C /opt/
echo 'export JAVA_HOME=/opt/jdk1.8.0_221
export PATH=$JAVA_HOME/bin:$PATH' >> /etc/profile
source /etc/profile

Zookeeper Cluster

  1. Confiugre zoo.cfg: ``` dataDir=/data/zookeeper server.1=node01:2888:3888 server.2=node02:2888:3888 server.3=node03:2888:3888
  2. Initialize node identities: ```

    On node01:

    echo 1 > /data/zookeeper/myid

    On node02:

    echo 2 > /data/zookeeper/myid

    On node03:

    echo 3 > /data/zookeeper/myid
  3. Start the ensemble: ``` zkServer.sh start
    
    

Hadoop Cluster Deployment

Configuration Files

core-site.xml

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node01:8020</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/data/hadoop/tmp</value>
  </property>
</configuration>

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/hadoop/namenode</value>
  </property>
</configuration>

yarn-site.xml

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>node01</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Cluster Initialization

  1. Format HDFS: ``` hdfs namenode -format
  2. Start services: ``` start-dfs.sh start-yarn.sh mapred --daemon start historyserver
    
    

Verification

Tags: Hadoop HDFS mapreduce YARN ZooKeeper

Posted on Tue, 26 May 2026 01:24:57 +0000 by SidewinderX