- Cluster Planning
This guide covers the setup of a two-node Hadoop cluster for HDFS and YARN. The configuration uses one master node and one resource manager node.
Note: Hive does not have a master-slave architecture issue. Multiple ResourceManager isntances can be deployed if needed.
- Common Configuration Steps
1.1 Creating Hadoop User
groupadd hadoop
useradd hadoop -G hadoop
1.2 Configuring SSH Passwordless Access
Both nodes need to communicate via SSH without passwords.
Step 1: Generate RSA key pair on both nodes (as hadoop user):
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Step 2: Add public key to authorized_keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Step 3: Copy worker-node's public key to master-node:
scp ~/.ssh/id_rsa.pub hadoop@master-node:~/.ssh/id_rsa_pub_worker
Step 4: Append worker's public key to master-node's authorized_keys:
cat ~/.ssh/id_rsa_pub_worker >> ~/.ssh/authorized_keys
Step 5: Copy master-node's authorized_keys to worker-node:
scp ~/.ssh/authorized_keys hadoop@worker-node:~/.ssh/
Verification:
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
Test connectivity:
ssh master-node
ssh worker-node
If login works without password, configuration is successful.
Note: For additional nodes, merge all public keys into a single authorized_keys file on the master, then distribute it to all worker nodes.
1.3 Creating Required Directories
Create data directories as the hadoop user:
mkdir -p /home/hadoop/hadoop_data/hdfs/name
mkdir -p /home/hadoop/hadoop_data/hdfs/data
mkdir -p /home/hadoop/hadoop_data/tmp
1.4 Installing Hadoop
Download and extract Hadoop 2.8.0 to /home/hadoop/:
tar -xzf hadoop-2.8.0.tar.gz -C /home/hadoop/
1.5 Environment Variables
Add these to ~/.bashrc or ~/.profile:
export JAVA_HOME=/usr/local/jdk1.8.0_131
export HADOOP_HOME=/home/hadoop/hadoop-2.8.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export PATH=$PATH:$HOME/.local/bin:$HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin
1.6 Loggging Configuration
Edit etc/hadoop/log4j.properties to add:
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=DEBUG
- Node Configuration
2.1 Master Node (192.168.56.2) Configuration
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://master-node:9001</value>
<description>HDFS URI - filesystem namenode:port, default is 9000</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hadoop_data/tmp</value>
<description>Temporary directory on namenode</description>
</property>
<property>
<name>ipc.client.connect.max.retries</name>
<value>100</value>
<description>Maximum connection retries (default is 10)</description>
</property>
<property>
<name>ipc.client.connect.retry.interval</name>
<value>10000</value>
<description>Connection retry interval in milliseconds (default is 100ms)</description>
</property>
<!-- Proxy configuration for tools like beeline -->
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/hadoop_data/hdfs/name</value>
<description>Directory for storing HDFS namespace metadata</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/hadoop_data/hdfs/data</value>
<description>Physical storage location for data blocks</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default replication factor (default is 3)</description>
</property>
<property>
<name>dfs.namenode.rpc-address</name>
<value>master-node:9001</value>
<description>RPC address for client requests</description>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>master-node:50070</value>
<description>NameNode web UI address and port</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>worker-node:50090</value>
<description>SecondaryNameNode web UI address</description>
</property>
slaves
Edit the slaves file to include:
master-node
worker-node
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
2.2 Worker Node (192.168.56.3) Configuration
core-site.xml
Same as master node configuration.
hdfs-site.xml
Same as master node configuration.
slaves
Same as master node configuration.
yarn-site.xml
<property>
<name>yarn.resourcemanager.address</name>
<value>worker-node:8032</value>
<description>ResourceManager address</description>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>worker-node</value>
<description>ResourceManager hostname</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
<description>Scheduler class</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>worker-node:8099</value>
<description>ResourceManager web UI address</description>
</property>
<property>
<name>yarn.nodemanager.webapp.address</name>
<value>worker-node:8042</value>
<description>NodeManager web UI address</description>
</property>
- Initialization
Format the NameNode on the master node:
hdfs namenode -format
Note: Do not run hdfs secondarynamenode -format.
- Starting the Cluster
4.1 Starting HDFS
Execute on master node:
cd $HADOOP_HOME/sbin
./start-dfs.sh
Output:
Starting namenodes on [master-node]
master-node: starting namenode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-namenode-master-node.out
worker-node: starting datanode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-datanode-worker-node.out
master-node: starting datanode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-datanode-master-node.out
Starting secondary namenodes [worker-node]
worker-node: starting secondarynamenode, logging to /home/hadoop/hadoop-2.8.0/logs/hadoop-hadoop-secondarynamenode-worker-node.out
Verify processes with jps command:
On master node (192.168.56.2):
5239 NameNode
5373 DataNode
On worker node (192.168.56.3):
4167 SecondaryNameNode
4069 DataNode
4.2 Starting YARN
Execute on worker node (where ResourceManager is configured):
cd $HADOOP_HOME/sbin
./start-yarn.sh
Note: YARN must be started from the node configured as ResourceManager.
- Testing the Cluster
5.1 Creating a Directory
hadoop fs -mkdir /testdir
hadoop fs -ls /
Output:
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2017-07-21 17:33 /testdir
5.2 Uploading a File
hadoop fs -copyFromLocal -f localfile.txt hdfs://master-node:9001/testdir/
5.3 Viewing File Content
hadoop fs -tail hdfs://master-node:9001/testdir/localfile.txt
5.4 Web Interface Access
NameNode UI: http://master-node:50070
DataNode UI: http://master-node:50075, http://worker-node:50075
5.5 Running WordCount Example
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep hdfs://master-node:9001/testdir/input output '(test)'
Output:
17/07/24 11:44:04 INFO mapreduce.Job: Counters: 29
File System Counters
FILE: Number of bytes read=604458
FILE: Number of bytes written=1252547
HDFS: Number of bytes read=1519982
HDFS: Number of bytes written=0
Map-Reduce Framework
Combine input records=0
Reduce input records=0
Reduce output records=0
Spilled Records=0
Shuffled Maps =0
GC time elapsed (ms)=0
Total committed heap usage (bytes)=169222144
Results are stored in /user/hadoop/output/ directory.
- Troubleshooting
6.1 IP Address Changes
If IP addresses change, the cluster may fail to start. After reformatting, manually clear old metadata directories:
rm -Rf /home/hadoop/hadoop_data/hdfs/name
rm -Rf /home/hadoop/hadoop_data/hdfs/data
rm -Rf /home/hadoop/hadoop_data/tmp
mkdir -p /home/hadoop/hadoop_data/hdfs/name
mkdir -p /home/hadoop/hadoop_data/hdfs/data
mkdir -p /home/hadoop/hadoop_data/tmp
6.2 NameNode vs SecondaryNameNode Configuration
The key configuration in hdfs-site.xml:
<property>
<name>dfs.namenode.http-address</name>
<value>master-node:50070</value>
<description>NameNode web UI port</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>worker-node:50090</value>
<description>SecondaryNameNode web UI port</description>
</property>
This is the essential configuration to understand basic Hadoop cluster operation.
6.3 Common Issues
- SSH connection failures - verify passwordless SSH is configured
- DataNode not starting - check directory permissions
- Web UI not accessible - verify firewall settings
- Next Steps
Further enhancements to consider:
- Adding more DataNodes to the cluster
- Setting up HA (High Availability) Hadoop cluster
- Integrating Prestto cluster on YARN
hadoop,hdfs,yarn,cluster-setup,distributed-computing,big-data