Setting Up a Two-Node Hadoop HDFS Cluster

  1. Cluster Planning

This guide covers the setup of a two-node Hadoop cluster for HDFS and YARN. The configuration uses one master node and one resource manager node.

Note: Hive does not have a master-slave architecture issue. Multiple ResourceManager isntances can be deployed if needed.


  1. Common Configuration Steps

1.1 Creating Hadoop User

groupadd hadoop
useradd hadoop -G hadoop

1.2 Configuring SSH Passwordless Access

Both nodes need to communicate via SSH without passwords.

Step 1: Generate RSA key pair on both nodes (as hadoop user):

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Step 2: Add public key to authorized_keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Step 3: Copy worker-node's public key to master-node:

scp ~/.ssh/id_rsa.pub hadoop@master-node:~/.ssh/id_rsa_pub_worker

Step 4: Append worker's public key to master-node's authorized_keys:

cat ~/.ssh/id_rsa_pub_worker >> ~/.ssh/authorized_keys

Step 5: Copy master-node's authorized_keys to worker-node:

scp ~/.ssh/authorized_keys hadoop@worker-node:~/.ssh/

Verification:

chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

Test connectivity:

ssh master-node
ssh worker-node

If login works without password, configuration is successful.

Note: For additional nodes, merge all public keys into a single authorized_keys file on the master, then distribute it to all worker nodes.

1.3 Creating Required Directories

Create data directories as the hadoop user:

mkdir -p /home/hadoop/hadoop_data/hdfs/name
mkdir -p /home/hadoop/hadoop_data/hdfs/data
mkdir -p /home/hadoop/hadoop_data/tmp

1.4 Installing Hadoop

Download and extract Hadoop 2.8.0 to /home/hadoop/:

tar -xzf hadoop-2.8.0.tar.gz -C /home/hadoop/

1.5 Environment Variables

Add these to ~/.bashrc or ~/.profile:

export JAVA_HOME=/usr/local/jdk1.8.0_131
export HADOOP_HOME=/home/hadoop/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export PATH=$PATH:$HOME/.local/bin:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin

1.6 Loggging Configuration

Edit etc/hadoop/log4j.properties to add:

log4j.logger.org.apache.hadoop.util.NativeCodeLoader=DEBUG

  1. Node Configuration

2.1 Master Node (192.168.56.2) Configuration

core-site.xml

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://master-node:9001</value>
  <description>HDFS URI - filesystem namenode:port, default is 9000</description>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/hadoop/hadoop_data/tmp</value>
  <description>Temporary directory on namenode</description>
</property>
<property>
  <name>ipc.client.connect.max.retries</name>
  <value>100</value>
  <description>Maximum connection retries (default is 10)</description>
</property>
<property>
  <name>ipc.client.connect.retry.interval</name>
  <value>10000</value>
  <description>Connection retry interval in milliseconds (default is 100ms)</description>
</property>
<!-- Proxy configuration for tools like beeline -->
<property>
  <name>hadoop.proxyuser.hadoop.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.hadoop.groups</name>
  <value>*</value>
</property>

hdfs-site.xml

<property>
  <name>dfs.namenode.name.dir</name>
  <value>/home/hadoop/hadoop_data/hdfs/name</value>
  <description>Directory for storing HDFS namespace metadata</description>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>/home/hadoop/hadoop_data/hdfs/data</value>
  <description>Physical storage location for data blocks</description>
</property>
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default replication factor (default is 3)</description>
</property>
<property>
  <name>dfs.namenode.rpc-address</name>
  <value>master-node:9001</value>
  <description>RPC address for client requests</description>
</property>
<property>
  <name>dfs.namenode.http-address</name>
  <value>master-node:50070</value>
  <description>NameNode web UI address and port</description>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>worker-node:50090</value>
  <description>SecondaryNameNode web UI address</description>
</property>

slaves

Edit the slaves file to include:

master-node
worker-node

yarn-site.xml

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>

2.2 Worker Node (192.168.56.3) Configuration

core-site.xml

Same as master node configuration.

hdfs-site.xml

Same as master node configuration.

slaves

Same as master node configuration.

yarn-site.xml

<property>
  <name>yarn.resourcemanager.address</name>
  <value>worker-node:8032</value>
  <description>ResourceManager address</description>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>worker-node</value>
  <description>ResourceManager hostname</description>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
  <description>Scheduler class</description>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address</name>
  <value>worker-node:8099</value>
  <description>ResourceManager web UI address</description>
</property>
<property>
  <name>yarn.nodemanager.webapp.address</name>
  <value>worker-node:8042</value>
  <description>NodeManager web UI address</description>
</property>

  1. Initialization

Format the NameNode on the master node:

hdfs namenode -format

Note: Do not run hdfs secondarynamenode -format.


  1. Starting the Cluster

4.1 Starting HDFS

Execute on master node:

cd $HADOOP_HOME/sbin
./start-dfs.sh

Output:

Starting namenodes on [master-node]
master-node: starting namenode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-namenode-master-node.out
worker-node: starting datanode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-datanode-worker-node.out
master-node: starting datanode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-datanode-master-node.out
Starting secondary namenodes [worker-node]
worker-node: starting secondarynamenode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-secondarynamenode-worker-node.out

Verify processes with jps command:

On master node (192.168.56.2):

5239 NameNode
5373 DataNode

On worker node (192.168.56.3):

4167 SecondaryNameNode
4069 DataNode

4.2 Starting YARN

Execute on worker node (where ResourceManager is configured):

cd $HADOOP_HOME/sbin
./start-yarn.sh

Note: YARN must be started from the node configured as ResourceManager.


  1. Testing the Cluster

5.1 Creating a Directory

hadoop fs -mkdir /testdir
hadoop fs -ls /

Output:

Found 1 items
drwxr-xr-x - hadoop supergroup 0 2017-07-21 17:33 /testdir

5.2 Uploading a File

hadoop fs -copyFromLocal -f localfile.txt hdfs://master-node:9001/testdir/

5.3 Viewing File Content

hadoop fs -tail hdfs://master-node:9001/testdir/localfile.txt

5.4 Web Interface Access

NameNode UI: http://master-node:50070

DataNode UI: http://master-node:50075, http://worker-node:50075

5.5 Running WordCount Example

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep hdfs://master-node:9001/testdir/input output '(test)'

Output:

17/07/24 11:44:04 INFO mapreduce.Job: Counters: 29
    File System Counters
        FILE: Number of bytes read=604458
        FILE: Number of bytes written=1252547
        HDFS: Number of bytes read=1519982
        HDFS: Number of bytes written=0
    Map-Reduce Framework
        Combine input records=0
        Reduce input records=0
        Reduce output records=0
        Spilled Records=0
        Shuffled Maps =0
        GC time elapsed (ms)=0
        Total committed heap usage (bytes)=169222144

Results are stored in /user/hadoop/output/ directory.


  1. Troubleshooting

6.1 IP Address Changes

If IP addresses change, the cluster may fail to start. After reformatting, manually clear old metadata directories:

rm -Rf /home/hadoop/hadoop_data/hdfs/name
rm -Rf /home/hadoop/hadoop_data/hdfs/data
rm -Rf /home/hadoop/hadoop_data/tmp
mkdir -p /home/hadoop/hadoop_data/hdfs/name
mkdir -p /home/hadoop/hadoop_data/hdfs/data
mkdir -p /home/hadoop/hadoop_data/tmp

6.2 NameNode vs SecondaryNameNode Configuration

The key configuration in hdfs-site.xml:

<property>
  <name>dfs.namenode.http-address</name>
  <value>master-node:50070</value>
  <description>NameNode web UI port</description>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>worker-node:50090</value>
  <description>SecondaryNameNode web UI port</description>
</property>

This is the essential configuration to understand basic Hadoop cluster operation.

6.3 Common Issues

  • SSH connection failures - verify passwordless SSH is configured
  • DataNode not starting - check directory permissions
  • Web UI not accessible - verify firewall settings

  1. Next Steps

Further enhancements to consider:

  1. Adding more DataNodes to the cluster
  2. Setting up HA (High Availability) Hadoop cluster
  3. Integrating Prestto cluster on YARN

hadoop,hdfs,yarn,cluster-setup,distributed-computing,big-data

Tags: Hadoop HDFS YARN cluster-setup distributed-computing

Posted on Thu, 14 May 2026 04:41:39 +0000 by Brentley_11