Hadoop Distributed Cluster Setup
This guide explains how to set up a fully distributed Hadoop cluster using three or more physical or virtual machines.
Cluster Architecture
- Master Node (hadoop0): NameNode, JobTracker, SecondaryNameNode
- Worker Nodes (hadoop1, hadoop2): DataNode, TaskTracker
Virtual Machine Setup
Create three virtual machines using cloning:
# Create base VM and then clone
# Use full cloning for independent instances
# Recommended memory: 400MB per node
# Node names: CentOS, CentOS1, CentOS2
Network Configuration
Configure IP addresses to each node:
# CentOS1: 192.168.80.101
# CentOS2: 192.168.80.102
# Restart network service
service network restart
Hostname Configuraton
Set hostnames for worker nodes:
# Edit network configuration
vi /etc/sysconfig/network
# Reboot to apply changes
reboot -h now
Clean Previous Configurations
Remove existing SSH and configuration files:
# Clean SSH directory
rm -rf /root/.ssh/*
# Remove local installations
rm -rf /usr/local/*
# Reset environment profile
vi /etc/profile
SSH Key Setup
Configure password-less SSH access:
# Generate SSH keys
ssh-keygen -t rsa
# Authorize keys
cd /root/.ssh/
cat id_rsa.pub >> authorized_keys
# Test local connection
ssh localhost
exit
Hosts File Configuration
Add host entries to /etc/hosts on all nodes:
192.168.80.100 hadoop0
192.168.80.101 hadoop1
192.168.80.102 hadoop2
Cross-Node SSH Setup
Enable mutual SSH access between all nodes:
# Copy public keys between nodes
ssh-copy-id -i hadoop1
ssh-copy-id -i hadoop2
# Verify connectivity
ssh hadoop1
ssh hadoop2
Software Distribution
Copy Hadooop and JDK installations to worker nodes:
# Clean master node directories
rm -rf /usr/local/hadoop/logs/
rm -rf /usr/local/hadoop/tmp/
# Copy to worker nodes
scp -r /usr/local/jdk hadoop1:/usr/local/
scp -r /usr/local/jdk hadoop2:/usr/local/
scp -r /usr/local/hadoop hadoop1:/usr/local/
scp -r /usr/local/hadoop hadoop2:/usr/local/
Environment Configuration
Distribute environment settings:
# Copy profile to workers
scp /etc/profile hadoop1:/etc/
scp /etc/profile hadoop2:/etc/
# Apply changes
source /etc/profile
Cluster Configuration
Configure slave nodes in Hadoop configuration:
# Edit slaves file
vi /usr/local/hadoop/conf/slaves
# Remove localhost and add:
hadoop1
hadoop2
Cluster Startup
Initialize and start the cluster:
# Format HDFS
hadoop namenode -format
# Start all services
start-all.sh
Web Interface Access
Add host entries to Windows hosts file (C:\Windows\System32\drivers\etc\hosts):
192.168.80.100 hadoop0
192.168.80.101 hadoop1
192.168.80.102 hadoop2
Access web interfaces at hadoop0:50070 (NameNode) and hadoop0:50030 (JobTracker)
Node Management
To add a new node:
# Configure new node environment
# Add to slaves file
# Start services on new node
hadoop-daemon.sh start datanode
hadoop-daemon.sh start tasktracker
# Refresh nodes
hadoop dfsadmin -refreshNodes
To remove a node:
# Stop specific daemon
kill -9 [process_id]