- Local Mode (Standalone): In this configuration, Hadoop functions purely as a library. It executes MapReduce jobs on a single machine without managing background processes. This mode is intended solely for debugging code and rapid prototyping.
- Pseudo-Distributed Mode: Here, all Hadoop daemons run as separate background processes on a single host. This setup simulates a cluster environment on one machine, making it ideal for educational purposes and integration testing.
- Distributed Mode: This represents the production-grade architecture where multiple nodes communicate over a network to provide high-availability data processing services.
HDFS Architecture and Lifecycle Management
Similar to database systems, HDFS operates through long-running daemon processes. Interaction occurs via a client application connecting to the server over a socket-based network interface.
Assuming a base environment is already prepared within a Docker container, follow these steps to establish a functional Hadoop unit. If the previous container session was terminated, restart the base instance labeled base_hadoop before proceeding.
Account Provisioning and Privilege Escalation
Create a dedicated system account to manage Hadooop processes securely rather than running as root:
useradd -m hadoop_user
Install necessary utilities for credential management and privilege handling:
yum install -y sudo passwd
Assign a secure password to the newly created account:
passwd hadoop_user
Ensure the software installation directory belongs to this new user:
chown -R hadoop_user /opt/hadoop
Udpate the /etc/sudoers file to grant the user passwordless sudo access where required. Append the following entry below the root definition:
hadoop_user ALL=(ALL) ALL
Once configured, persist the current state of the base container into a reusable image:
docker commit base_hadoop image_hadoop_deploy
Instance Creation and SSH Trust
Launch a fresh container from the committed image specifically for HDFS operations:
docker run -d --name hdfs_instance -t image_hadoop_deploy /sbin/init
Attach to the container and switch to the restricted user:
docker exec -it hdfs_instance su hadoop_user
whoami
To enable seamless internal communication, generate an SSH key pair:
ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa
Push the public key to the authorized identity list. First, retrieve the container's internal IP address dynamically:
DOCKER_IP=$(ip addr show docker0 | grep inet | awk '{print $2}' | cut -d'/' -f1 | head -n1)
echo $DOCKER_IP
Add the generated key to the host known_hosts file based on the detected IP:
ssh-copy-id hadoop_user@$DOCKER_IP
Core Configuration and Service Launch
Navigate to the configuration directory relative to the installation path:
cd $HADOOP_HOME/etc/hadoop
Modify core-site.xml to define the default filesystem namespace:
sed -i '/<\/configuration>/i\
<property>\
<name>fs.defaultFS</name>\
<value>hdfs://'"$DOCKER_IP"':9000</value>\
</property>' core-site.xml
Adjust hdfs-site.xml to set the replication count suitable for a single-node test environment:
sed -i '/<\/configuration>/i\
<property>\
<name>dfs.replication</name>\
<value>1</value>\
</property>' hdfs-site.xml
Before starting services, initialize the NameNode metadata:
hdfs namenode -format
Initiate the HDFS daemons using the provided shell script:
./start-dfs.sh
This command triggers the startup sequence for the NameNode, DataNode, and SecondaryNameNode components. Verify that the processes are active by listing Java runtime threads:
jps
Successful deployment can be confirmed by visiting the NameNode web console via HTTP:
http://<YOUR_CONTAINER_IP>:9870/
Accessing this endpoint displays the cluster status and storage metrics directly in the browser interface.