Automating Hadoop and Hive Pseudo-Distributed Deployment with Bash Scripts

Project Structure Overview

The automation solution is organized into specific directories to separate concerns:

  • lib/: Contains external Java libraries required for the setup, including dom4j for XML parsing and the MySQL JDBC driver.
  • software/: Stores the binary packages for Hadoop and Hive (e.g., hadoop-2.6.0-cdh5.10.0.tar.gz).
  • scripts/: Houses the shell scripts responsible for the installation logic, environment configuration, and execution flow.

System Prerequisites

Prior to execution, ensure the target Linux environment meets the following requirements:

  • Java Development Kit (JDK) is installed.
  • MySQL database server is installed and running.
  • The system firewall is disabled or configured to allow required ports.
  • Network connectivity is established (ability to ping external hosts).
  • The hostname is properly configured in /etc/hostname.

Preparing the Environment

Create a dedicated directory for the installation files and adjust permissions to allow the non-root user to manage the /opt directory:

chown username /opt
mkdir -p /opt/hadoop-install

Place the installation scripts into the created directory and grant execution rights:

chmod +x main.sh env-config.sh functions.sh

Configuration Variables

The env-config.sh file defines static parameters and dynamic inputs required for the setup. This includes installation paths, database credentials, and XML configuration values.

#!/bin/bash

# Primary Installation Directory
BASE_INSTALL_DIR="/opt/hadoop"

# Database Connection Parameters
DB_HOST="192.168.59.100"
DB_PORT="3306"
DB_NAME="hive_metadata"
DB_USER="root"
DB_PASSWORD="password"
MYSQL_JAR="mysql-connector-java-5.1.42-bin.jar"

# Java Environment
JAVA_HOME_PATH="/opt/software/jdk1.8.0_131"

# Internal Configuration Paths (Do not modify unless necessary)
HADOOP_CONF_DIR="/etc/hadoop"
TEMP_DIR_HADOOP="${BASE_INSTALL_DIR}/tmp/hadoop"
TEMP_DIR_HIVE="${BASE_INSTALL_DIR}/tmp/hive"

# Environment script targets
ENV_SCRIPTS=("hadoop-env.sh" "mapred-env.sh" "yarn-env.sh")

# Hadoop XML configuration definitions
CORE_SITE_PARAMS=("core-site.xml" "fs.defaultFS" "hdfs://$(hostname):9000" "hadoop.tmp.dir" "${TEMP_DIR_HADOOP}")
HDFS_SITE_PARAMS=("hdfs-site.xml" "dfs.replication" "1")

# Hive Configuration
HIVE_LOG_DIR="${BASE_INSTALL_DIR}/logs/hive"
HIVE_SITE_PARAMS=("hive-site.xml" "javax.jdo.option.ConnectionURL" "jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME}?createDatabaseIfNotExist=true" "javax.jdo.option.ConnectionDriverName" "com.mysql.jdbc.Driver" "javax.jdo.option.ConnectionUserName" "${DB_USER}" "javax.jdo.option.ConnectionPassword" "${DB_PASSWORD}")

Core Function Library

The functions.sh script contains the logic for directory preparation, file extraction, and configuration modification.

#!/bin/bash
source ./env-config.sh

# Directory preparation and cleanup
prepare_directory() {
    if [ -d "$1" ]; then
        echo "Directory $1 exists. Cleaning contents..."
        rm -rf "${1:?}"/*
    else
        mkdir -p "$1"
    fi
}

# Extract tar.gz archives
extract_package() {
    local pkg_name=$1
    local target_dir=$2
    local archive=$(find ../software -name "${pkg_name}*" | head -n 1)
    tar -xzf "$archive" -C "$target_dir"
    if [ $? -eq 0 ]; then echo "$pkg_name extracted successfully."; else exit 1; fi
}

# Modify Hadoop environment scripts (non-XML)
configure_env_scripts() {
    local install_dir=$1
    local hadoop_home_dir=$(ls "$install_dir" | grep hadoop)
    local conf_path="${install_dir}/${hadoop_home_dir}${HADOOP_CONF_DIR}"

    for script in "${ENV_SCRIPTS[@]}"; do
        sed -i '/export JAVA_HOME/d' "${conf_path}/${script}"
        sed -i "2a export JAVA_HOME=${JAVA_HOME_PATH}" "${conf_path}/${script}"
    done
    
    # Configure PID directory
    sed -i "s|export HADOOP_PID_DIR=.*|export HADOOP_PID_DIR=${TEMP_DIR_HADOOP}/pid|g" "${conf_path}/hadoop-env.sh"
}

# Update XML configuration files using Java helper
update_xml_config() {
    local config_array=("$@")
    local file_name="${config_array[0]}"
    local hadoop_home_dir=$(ls "${BASE_INSTALL_DIR}" | grep hadoop)
    local file_path="${BASE_INSTALL_DIR}/${hadoop_home_dir}${HADOOP_CONF_DIR}/${file_name}"

    local i=1
    while [ $i -lt ${#config_array[@]} ]; do
        local key="${config_array[$i]}"
        local val="${config_array[$((i+1))]}"
        java -jar ../lib/XmlUpdater.jar "$file_path" add "$key" "$val"
        ((i+=2))
    done
}

# Format the NameNode
format_namenode() {
    local hadoop_home=$(ls "${BASE_INSTALL_DIR}" | grep hadoop)
    "${BASE_INSTALL_DIR}/${hadoop_home}/bin/hdfs" namenode -format
}

Main Execution Script

The main.sh orchestrates the installation sequence by sourcing the environment and function files.

#!/bin/bash
source ./env-config.sh
source ./functions.sh

# Setup directories
prepare_directory "${BASE_INSTALL_DIR}"
mkdir -p "${TEMP_DIR_HADOOP}"

# Install Packages
extract_package hadoop "${BASE_INSTALL_DIR}"
extract_package hive "${BASE_INSTALL_DIR}"

# Configure Hadoop
configure_env_scripts "${BASE_INSTALL_DIR}"
update_xml_config "${CORE_SITE_PARAMS[@]}"
update_xml_config "${HDFS_SITE_PARAMS[@]}"

# Initialize Filesystem
format_namenode

# Configure Hive (simplified example)
local hive_home=$(ls "${BASE_INSTALL_DIR}" | grep hive)
mkdir -p "${HIVE_LOG_DIR}"
cp ../lib/${MYSQL_JAR} "${BASE_INSTALL_DIR}/${hive_home}/lib/"

echo "Pseudo-distributed installation completed."

Java XML Configuration Utility

The shell scripts rely on a Java utility to manipulate XML configuration files. The following Java code uses dom4j to inject properties into Hadoop and Hive configuration files.

package com.deploy.utils;

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import org.dom4j.io.XMLWriter;
import org.dom4j.io.OutputFormat;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class XmlConfigUpdater {

    public static void main(String[] args) {
        if (args.length < 4) {
            System.err.println("Usage: java -jar XmlUpdater.jar  <action> <key> <value>");
            System.exit(1);
        }

        String filePath = args[0];
        String key = args[2];
        String value = args[3];

        try {
            manipulateXmlProperty(filePath, key, value);
        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }

    private static void manipulateXmlProperty(String filePath, String key, String value) throws DocumentException, IOException {
        SAXReader reader = new SAXReader();
        Document document = reader.read(new File(filePath));
        Element root = document.getRootElement();

        // Create property element structure
        Element property = root.addElement("property");
        property.addElement("name").setText(key);
        property.addElement("value").setText(value);

        // Write back to file with pretty print format
        OutputFormat format = OutputFormat.createPrettyPrint();
        format.setEncoding("UTF-8");
        
        try (XMLWriter writer = new XMLWriter(new FileWriter(filePath), format)) {
            writer.write(document);
        }
    }
}

Tags: Hadoop Hive bash automation Linux

Posted on Mon, 18 May 2026 03:08:58 +0000 by galayman