Apache Hadoop Deployment Strategies and HDFS Initialization on Docker

  • Local Mode (Standalone): In this configuration, Hadoop functions purely as a library. It executes MapReduce jobs on a single machine without managing background processes. This mode is intended solely for debugging code and rapid prototyping.
  • Pseudo-Distributed Mode: Here, all Hadoop daemons run as separate background processes on a single host. This setup simulates a cluster environment on one machine, making it ideal for educational purposes and integration testing.
  • Distributed Mode: This represents the production-grade architecture where multiple nodes communicate over a network to provide high-availability data processing services.

HDFS Architecture and Lifecycle Management

Similar to database systems, HDFS operates through long-running daemon processes. Interaction occurs via a client application connecting to the server over a socket-based network interface.

Assuming a base environment is already prepared within a Docker container, follow these steps to establish a functional Hadoop unit. If the previous container session was terminated, restart the base instance labeled base_hadoop before proceeding.

Account Provisioning and Privilege Escalation

Create a dedicated system account to manage Hadooop processes securely rather than running as root:

useradd -m hadoop_user

Install necessary utilities for credential management and privilege handling:

yum install -y sudo passwd

Assign a secure password to the newly created account:

passwd hadoop_user

Ensure the software installation directory belongs to this new user:

chown -R hadoop_user /opt/hadoop

Udpate the /etc/sudoers file to grant the user passwordless sudo access where required. Append the following entry below the root definition:

hadoop_user ALL=(ALL)       ALL

Once configured, persist the current state of the base container into a reusable image:

docker commit base_hadoop image_hadoop_deploy

Instance Creation and SSH Trust

Launch a fresh container from the committed image specifically for HDFS operations:

docker run -d --name hdfs_instance -t image_hadoop_deploy /sbin/init

Attach to the container and switch to the restricted user:

docker exec -it hdfs_instance su hadoop_user
whoami

To enable seamless internal communication, generate an SSH key pair:

ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa

Push the public key to the authorized identity list. First, retrieve the container's internal IP address dynamically:

DOCKER_IP=$(ip addr show docker0 | grep inet | awk '{print $2}' | cut -d'/' -f1 | head -n1)
echo $DOCKER_IP

Add the generated key to the host known_hosts file based on the detected IP:

ssh-copy-id hadoop_user@$DOCKER_IP

Core Configuration and Service Launch

Navigate to the configuration directory relative to the installation path:

cd $HADOOP_HOME/etc/hadoop

Modify core-site.xml to define the default filesystem namespace:

sed -i '/<\/configuration>/i\
<property>\
<name>fs.defaultFS</name>\
<value>hdfs://'"$DOCKER_IP"':9000</value>\
</property>' core-site.xml

Adjust hdfs-site.xml to set the replication count suitable for a single-node test environment:

sed -i '/<\/configuration>/i\
<property>\
<name>dfs.replication</name>\
<value>1</value>\
</property>' hdfs-site.xml

Before starting services, initialize the NameNode metadata:

hdfs namenode -format

Initiate the HDFS daemons using the provided shell script:

./start-dfs.sh

This command triggers the startup sequence for the NameNode, DataNode, and SecondaryNameNode components. Verify that the processes are active by listing Java runtime threads:

jps

Successful deployment can be confirmed by visiting the NameNode web console via HTTP:

http://<YOUR_CONTAINER_IP>:9870/

Accessing this endpoint displays the cluster status and storage metrics directly in the browser interface.

Tags: Apache Hadoop Docker Containerization HDFS SSH Authentication Linux Administration

Posted on Mon, 18 May 2026 22:40:05 +0000 by The Cat