Automating Hadoop and Hive Pseudo-Distributed Deployment with Bash Scripts

Project Structure Overview

The automation solution is organized into specific directories to separate concerns:

lib/: Contains external Java libraries required for the setup, including dom4j for XML parsing and the MySQL JDBC driver.
software/: Stores the binary packages for Hadoop and Hive (e.g., hadoop-2.6.0-cdh5.10.0.tar.gz).
scripts/: Houses the shell scripts responsible for the installation logic, environment configuration, and execution flow.

System Prerequisites

Prior to execution, ensure the target Linux environment meets the following requirements:

Java Development Kit (JDK) is installed.
MySQL database server is installed and running.
The system firewall is disabled or configured to allow required ports.
Network connectivity is established (ability to ping external hosts).
The hostname is properly configured in /etc/hostname.

Preparing the Environment

Create a dedicated directory for the installation files and adjust permissions to allow the non-root user to manage the /opt directory:

chown username /opt
mkdir -p /opt/hadoop-install

Place the installation scripts into the created directory and grant execution rights:

chmod +x main.sh env-config.sh functions.sh

Configuration Variables

The env-config.sh file defines static parameters and dynamic inputs required for the setup. This includes installation paths, database credentials, and XML configuration values.

#!/bin/bash

# Primary Installation Directory
BASE_INSTALL_DIR="/opt/hadoop"

# Database Connection Parameters
DB_HOST="192.168.59.100"
DB_PORT="3306"
DB_NAME="hive_metadata"
DB_USER="root"
DB_PASSWORD="password"
MYSQL_JAR="mysql-connector-java-5.1.42-bin.jar"

# Java Environment
JAVA_HOME_PATH="/opt/software/jdk1.8.0_131"

# Internal Configuration Paths (Do not modify unless necessary)
HADOOP_CONF_DIR="/etc/hadoop"
TEMP_DIR_HADOOP="${BASE_INSTALL_DIR}/tmp/hadoop"
TEMP_DIR_HIVE="${BASE_INSTALL_DIR}/tmp/hive"

# Environment script targets
ENV_SCRIPTS=("hadoop-env.sh" "mapred-env.sh" "yarn-env.sh")

# Hadoop XML configuration definitions
CORE_SITE_PARAMS=("core-site.xml" "fs.defaultFS" "hdfs://$(hostname):9000" "hadoop.tmp.dir" "${TEMP_DIR_HADOOP}")
HDFS_SITE_PARAMS=("hdfs-site.xml" "dfs.replication" "1")

# Hive Configuration
HIVE_LOG_DIR="${BASE_INSTALL_DIR}/logs/hive"
HIVE_SITE_PARAMS=("hive-site.xml" "javax.jdo.option.ConnectionURL" "jdbc:mysql://${DB_HOST}:${DB_PORT}/${DB_NAME}?createDatabaseIfNotExist=true" "javax.jdo.option.ConnectionDriverName" "com.mysql.jdbc.Driver" "javax.jdo.option.ConnectionUserName" "${DB_USER}" "javax.jdo.option.ConnectionPassword" "${DB_PASSWORD}")

Core Function Library

The functions.sh script contains the logic for directory preparation, file extraction, and configuration modification.

#!/bin/bash
source ./env-config.sh

# Directory preparation and cleanup
prepare_directory() {
    if [ -d "$1" ]; then
        echo "Directory $1 exists. Cleaning contents..."
        rm -rf "${1:?}"/*
    else
        mkdir -p "$1"
    fi
}

# Extract tar.gz archives
extract_package() {
    local pkg_name=$1
    local target_dir=$2
    local archive=$(find ../software -name "${pkg_name}*" | head -n 1)
    tar -xzf "$archive" -C "$target_dir"
    if [ $? -eq 0 ]; then echo "$pkg_name extracted successfully."; else exit 1; fi
}

# Modify Hadoop environment scripts (non-XML)
configure_env_scripts() {
    local install_dir=$1
    local hadoop_home_dir=$(ls "$install_dir" | grep hadoop)
    local conf_path="${install_dir}/${hadoop_home_dir}${HADOOP_CONF_DIR}"

    for script in "${ENV_SCRIPTS[@]}"; do
        sed -i '/export JAVA_HOME/d' "${conf_path}/${script}"
        sed -i "2a export JAVA_HOME=${JAVA_HOME_PATH}" "${conf_path}/${script}"
    done
    
    # Configure PID directory
    sed -i "s|export HADOOP_PID_DIR=.*|export HADOOP_PID_DIR=${TEMP_DIR_HADOOP}/pid|g" "${conf_path}/hadoop-env.sh"
}

# Update XML configuration files using Java helper
update_xml_config() {
    local config_array=("$@")
    local file_name="${config_array[0]}"
    local hadoop_home_dir=$(ls "${BASE_INSTALL_DIR}" | grep hadoop)
    local file_path="${BASE_INSTALL_DIR}/${hadoop_home_dir}${HADOOP_CONF_DIR}/${file_name}"

    local i=1
    while [ $i -lt ${#config_array[@]} ]; do
        local key="${config_array[$i]}"
        local val="${config_array[$((i+1))]}"
        java -jar ../lib/XmlUpdater.jar "$file_path" add "$key" "$val"
        ((i+=2))
    done
}

# Format the NameNode
format_namenode() {
    local hadoop_home=$(ls "${BASE_INSTALL_DIR}" | grep hadoop)
    "${BASE_INSTALL_DIR}/${hadoop_home}/bin/hdfs" namenode -format
}

Main Execution Script

The main.sh orchestrates the installation sequence by sourcing the environment and function files.

#!/bin/bash
source ./env-config.sh
source ./functions.sh

# Setup directories
prepare_directory "${BASE_INSTALL_DIR}"
mkdir -p "${TEMP_DIR_HADOOP}"

# Install Packages
extract_package hadoop "${BASE_INSTALL_DIR}"
extract_package hive "${BASE_INSTALL_DIR}"

# Configure Hadoop
configure_env_scripts "${BASE_INSTALL_DIR}"
update_xml_config "${CORE_SITE_PARAMS[@]}"
update_xml_config "${HDFS_SITE_PARAMS[@]}"

# Initialize Filesystem
format_namenode

# Configure Hive (simplified example)
local hive_home=$(ls "${BASE_INSTALL_DIR}" | grep hive)
mkdir -p "${HIVE_LOG_DIR}"
cp ../lib/${MYSQL_JAR} "${BASE_INSTALL_DIR}/${hive_home}/lib/"

echo "Pseudo-distributed installation completed."

Java XML Configuration Utility

The shell scripts rely on a Java utility to manipulate XML configuration files. The following Java code uses dom4j to inject properties into Hadoop and Hive configuration files.

package com.deploy.utils;

import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import org.dom4j.io.XMLWriter;
import org.dom4j.io.OutputFormat;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;

public class XmlConfigUpdater {

    public static void main(String[] args) {
        if (args.length < 4) {
            System.err.println("Usage: java -jar XmlUpdater.jar  <action> <key> <value>");
            System.exit(1);
        }

        String filePath = args[0];
        String key = args[2];
        String value = args[3];

        try {
            manipulateXmlProperty(filePath, key, value);
        } catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }

    private static void manipulateXmlProperty(String filePath, String key, String value) throws DocumentException, IOException {
        SAXReader reader = new SAXReader();
        Document document = reader.read(new File(filePath));
        Element root = document.getRootElement();

        // Create property element structure
        Element property = root.addElement("property");
        property.addElement("name").setText(key);
        property.addElement("value").setText(value);

        // Write back to file with pretty print format
        OutputFormat format = OutputFormat.createPrettyPrint();
        format.setEncoding("UTF-8");
        
        try (XMLWriter writer = new XMLWriter(new FileWriter(filePath), format)) {
            writer.write(document);
        }
    }
}

Tags: Hadoop Hive bash automation Linux

Posted on Mon, 18 May 2026 03:08:58 +0000 by galayman

Freaks City