This guide walks through the steps to set up a Hadoop 2.10 pseudo-distributed cluster on a single CentOS 7 virtual machine.
1. Create a Hadoop User and Group
We will create a dedicated user hdfs and configure it with appropriate permissions.
As root user:
Create the hdfs user and set a password:
adduser hdfs
passwd hdfs
Add the user to the hdfs group (create the group if it doesn't exist):
groupadd hdfs
usermod -a -G hdfs hdfs
Verify the user and group:
cat /etc/group
groups hdfs
You should see something like: hdfs:x:1001:hdfs
Grant sudo privileges (optional but recommended):
Edit the sudoers file:
visudo # or vim /etc/sudoers
Add the following line below root ALL=(ALL) ALL:
hdfs ALL=(ALL) ALL
Create a software installation directory:
mkdir /opt/soft
chown -R hdfs:hdfs /opt/soft
2. Install JDK and Configure Environment Variables
Download a JDK (e.g., jdk-8u231-linux-x64.tar.gz) and extract it:
tar -zxvf jdk-8u231-linux-x64.tar.gz -C /opt/soft/
Create a symbolic link for easier version management:
cd /opt/soft
ln -s jdk1.8.0_231 jdk
Set environment variables in /etc/profile:
vim /etc/profile
Add the following lines:
# jdk
export JAVA_HOME=/opt/soft/jdk
export PATH=$PATH:$JAVA_HOME/bin
Reload the profile:
source /etc/profile
Verify installation:
java -version
3. Install Hadoop 2.10.0
Download hadoop-2.10.0.tar.gz and extract it:
tar -zxvf hadoop-2.10.0.tar.gz -C /opt/soft/
Create a symbolic link:
cd /opt/soft
ln -s hadoop-2.10.0 hadoop
Add Hadoop environment variables to /etc/profile:
# hadoop
export HADOOP_HOME=/opt/soft/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Reload and verify:
source /etc/profile
hadoop version
Configure Pseudo-Distributed Mode
Edit the following Hadoop configuration files in /opt/soft/hadoop/etc/hadoop/.
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml (if mapred-site.xml.template exists, copy it first)
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
4. Configure SSH Passwordless Login
Hadoop uses SSH to manage daemons. Set up passwordless SSH for the hdfs user.
-
Check if SSH packages are installed:
yum list installed | grep ssh -
Ensure the SSH daemon is running:
ps -Af | grep sshd -
Generate an SSH key pair (as the
hdfsuser):ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa -
Append the public key to the authorized keys file:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys -
Set appropriate permissions:
chmod 644 ~/.ssh/authorized_keys -
Test the setup:
ssh localhostYou should not be prompted for a password.
5. Format and Start Hadoop
Switch to the hdfs user:
su - hdfs
Format the NameNode:
hadoop namenode -format
First Start Attempt and Troubleshooting
Start all Hadoop daemons:
start-all.sh
If you get a JAVA_HOME not found error, edit the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh and uncomment/set the JAVA_HOME environment variable:
export JAVA_HOME=/opt/soft/jdk
Then try again:
stop-all.sh
start-all.sh
Check if all processes are running:
jps
If the NameNode is missing, check the logs:
tail -200f $HADOOP_HOME/logs/hadoop-hdfs-namenode-*.log
A common error is:
Directory /tmp/hadoop-hdfs/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
This can happen if the NameNode directory was lost (e.g., after a reboot). Solution: Re-format the NameNode.
hadoop namenode -format
stop-all.sh
start-all.sh
Verify all processes are running:
jps
Expected processes:
- NameNode
- DataNode
- SecondaryNameNode
- ResourceManager
- NodeManager
6. Verify the Setup
Open a web browser and go to:
http://<your-server-ip>:50070
For example, if your VM IP is 192.168.30.141, visit http://192.168.30.141:50070.
You should see the Hadoop NameNode Web UI.
Firewall Configuration (if needed)
If you cannot access the web UI, check if the firewall is running:
firewall-cmd --state
To stop the firewall temporarily (as root):
systemctl stop firewalld.service
To disable it permanently:
systemctl disable firewalld.service