Apache Hadoop Deployment Strategies and HDFS Initialization on Docker

Local Mode (Standalone): In this configuration, Hadoop functions purely as a library. It executes MapReduce jobs on a single machine without managing background processes. This mode is intended solely for debugging code and rapid prototyping.
Pseudo-Distributed Mode: Here, all Hadoop daemons run as separate background processes on a single host. This setup simulates a cluster environment on one machine, making it ideal for educational purposes and integration testing.
Distributed Mode: This represents the production-grade architecture where multiple nodes communicate over a network to provide high-availability data processing services.

HDFS Architecture and Lifecycle Management

Similar to database systems, HDFS operates through long-running daemon processes. Interaction occurs via a client application connecting to the server over a socket-based network interface.

Assuming a base environment is already prepared within a Docker container, follow these steps to establish a functional Hadoop unit. If the previous container session was terminated, restart the base instance labeled base_hadoop before proceeding.

Account Provisioning and Privilege Escalation

Create a dedicated system account to manage Hadooop processes securely rather than running as root:

useradd -m hadoop_user

Install necessary utilities for credential management and privilege handling:

yum install -y sudo passwd

Assign a secure password to the newly created account:

passwd hadoop_user

Ensure the software installation directory belongs to this new user:

chown -R hadoop_user /opt/hadoop

Udpate the /etc/sudoers file to grant the user passwordless sudo access where required. Append the following entry below the root definition:

hadoop_user ALL=(ALL)       ALL

Once configured, persist the current state of the base container into a reusable image:

docker commit base_hadoop image_hadoop_deploy

Instance Creation and SSH Trust

Launch a fresh container from the committed image specifically for HDFS operations:

docker run -d --name hdfs_instance -t image_hadoop_deploy /sbin/init

Attach to the container and switch to the restricted user:

docker exec -it hdfs_instance su hadoop_user
whoami

To enable seamless internal communication, generate an SSH key pair:

ssh-keygen -t rsa -b 4096 -f ~/.ssh/id_rsa

Push the public key to the authorized identity list. First, retrieve the container's internal IP address dynamically:

DOCKER_IP=$(ip addr show docker0 | grep inet | awk '{print $2}' | cut -d'/' -f1 | head -n1)
echo $DOCKER_IP

Add the generated key to the host known_hosts file based on the detected IP:

ssh-copy-id hadoop_user@$DOCKER_IP

Core Configuration and Service Launch

Navigate to the configuration directory relative to the installation path:

cd $HADOOP_HOME/etc/hadoop

Modify core-site.xml to define the default filesystem namespace:

sed -i '/<\/configuration>/i\
<property>\
<name>fs.defaultFS</name>\
<value>hdfs://'"$DOCKER_IP"':9000</value>\
</property>' core-site.xml

Adjust hdfs-site.xml to set the replication count suitable for a single-node test environment:

sed -i '/<\/configuration>/i\
<property>\
<name>dfs.replication</name>\
<value>1</value>\
</property>' hdfs-site.xml

Before starting services, initialize the NameNode metadata:

hdfs namenode -format

Initiate the HDFS daemons using the provided shell script:

./start-dfs.sh

This command triggers the startup sequence for the NameNode, DataNode, and SecondaryNameNode components. Verify that the processes are active by listing Java runtime threads:

jps

Successful deployment can be confirmed by visiting the NameNode web console via HTTP:

http://<YOUR_CONTAINER_IP>:9870/

Accessing this endpoint displays the cluster status and storage metrics directly in the browser interface.

Tags: Apache Hadoop Docker Containerization HDFS SSH Authentication Linux Administration

Posted on Mon, 18 May 2026 22:40:05 +0000 by The Cat

Freaks City

Apache Hadoop Deployment Strategies and HDFS Initialization on Docker

HDFS Architecture and Lifecycle Management

Account Provisioning and Privilege Escalation

Instance Creation and SSH Trust

Core Configuration and Service Launch

Hot Tags