Configuring a Standalone Apache Hadoop and Hive Data Processing Stack on Linux

System Prerequisites and Network Configuration

Before deploying the big data stack, establish a stable base environment. The following specifications apply to this deployment:

  • OS: CentOS 7 (x86_64)
  • Resources: 2 vCPUs, 4GB RAM, 50GB Disk
  • Software Versions: OpenJDK 11, Hadoop 3.3.6, Apache Hive 3.1.3, MySQL 8.0 Community Server

Host Identity and Resolution

Assign a consistent identifier to the machine. This simplifies cluster networking scripts.

sudo hostnamectl set-hostname analytics-cluster-node
sudo systemctl restart systemd-hostnamed

Register the hostname in the local resolver table to prevent DNS lookup delays during daemon startup.

sudo bash -c 'echo "127.0.0.1 analytics-cluster-node" >> /etc/hosts'

Security and Time Synchronization

Disable packet filtering to allow inter-process communication and external management tools.

sudo systemctl stop firewalld.service
sudo systemctl disable firewalld.service

Verify clock accuracy across nodes. Discrepancies can cause HDFS replication issues.

timedatectl status
sudo chronyc tracking

Apache Hadoop Deployment

Environment Variables

Define runtime paths in the global profile to avoid hardcoding in scripts.

sudo nano /etc/profile.d/hadoop_env.sh

Append the folllowing definitions:

export HADOOP_INSTALL=/opt/modules/hadoop-3.3
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop

Apply changes: source /etc/profile.d/hadoop_env.sh

Core Distributed Filesystem Settings

Configure basic filesystem behavior and RPC endpoints.

<!-- $HADOOP_INSTALL/etc/hadoop/core-site.xml -->
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://analytics-cluster-node:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/data/hadoop/tmp</value>
  </property>
</configuration>

Storage Node Topology

Define metadata storage locations and replication factors.

<!-- $HADOOP_INSTALL/etc/hadoop/hdfs-site.xml -->
<configuration>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/var/data/hadoop/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/var/data/hadoop/hdfs/datanode</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
  </property>
</configuration>

Compute Framework Routing

Redirect MapReduce tasks to the YARN resource manager.

<!-- $HADOOP_INSTALL/etc/hadoop/mapred-site.xml -->
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.application.framework.path</name>
    <value>$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar</value>
  </property>
</configuration>

Initialize the distributed namespace and launch background services.

hdfs namenode -format
start-dfs.sh
start-yarn.sh
jps

Validate service availability via browser endpoints: http://<ip>:8088 (ResourceManager) and http://<ip>:9870 (NameNode Web UI).

Relational Metadata Backend (MySQL)

Hive requires an external RDBMS to store schema definitions and lineage information. Install the community edition using package managers.

sudo yum install mysql-server mysql-devel -y
sudo systemctl enable --now mysqld
sudo systemctl status mysqld

Harden the database instance and prepare credentials for the metastore connector.

ALTER USER 'root'@'localhost' IDENTIFIED BY 'SecureMetastorePass!';
FLUSH PRIVILEGES;

Enable cross-network connectivity for administrative clients.

CREATE USER 'metastore_admin'@'%' IDENTIFIED BY 'ClusterAdminPwd!';
GRANT ALL PRIVILEGES ON *.* TO 'metastore_admin'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;

Apache Hive Configuration

Binary Extraction and Path Mapping

Unarchive the distribution and expose binaries globally.

sudo tar -xf apache-hive-3.1.3-bin.tar.gz -C /opt/modules/
sudo mv /opt/modules/apache-hive-3.1.3-bin /opt/modules/hive-3.1
echo "export HIVE_HOME=/opt/modules/hive-3.1" >> /etc/profile.d/hadoop_env.sh
echo "export PATH=\$PATH:\$HIVE_HOME/bin" >> /etc/profile.d/hadoop_env.sh
source /etc/profile.d/hadoop_env.sh

Metastore Properties

Generate the active configuration file and direct queries toward MySQL.

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml

Inject connection parameters and storage directives. Clear redundant placeholders to maintain readability.

<!-- $HIVE_HOME/conf/hive-site.xml -->
<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://analytics-cluster-node:3306/hive_meta?createDatabaseIfNotExist=true&amp;useSSL=false&amp;serverTimezone=UTC</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>metastore_admin</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>ClusterAdminPwd!</value>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
  </property>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>/tmp/hive/workdir</value>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
</configuration>

Create required HDFS directories and enforce access rights for the processing engine.

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
mkdir -p /tmp/hive/workdir
chmod 1777 /tmp/hive/workdir

Link native libraries and define auxiliary classpaths.

<!-- $HIVE_HOME/conf/hive-env.sh (Shell Syntax) -->
export HADOOP_HOME=/opt/modules/hadoop-3.3
export HIVE_CONF_DIR=/opt/modules/hive-3.1/conf
export HIVE_AUX_JARS_PATH=/opt/modules/hive-3.1/lib

Deploy the JDBC connector artifact into the runtime library pool.

Initialization and Query Validation

Bootstrap the metastore schema against the relational backend.

schematool -dbType mysql -initSchema

Launch the interactive command-line interface.

hive

Within the session, define a namespace and structural layout for tabular data.

CREATE DATABASE sales_analytics;
USE sales_analytics;
CREATE TABLE transaction_log (
  tx_id INT,
  customer_name STRING,
  amount DECIMAL(10,2)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Prepare a flat dataset containing pipe-delimited records. Transfer the file to the Linux workspace using an SFTP client.

# data_dump.tsv
1001	Alice	245.50
1002	Bob	112.00
1003	Carol	890.75

Ingest the payload into the managed table.

LOAD DATA LOCAL INPATH '/workspace/data_dump.tsv' INTO TABLE transaction_log;
SELECT tx_id, customer_name, amount FROM transaction_log WHERE amount > 100.00;

Tags: Hadoop Hive MySQL centos big-data

Posted on Wed, 13 May 2026 09:03:51 +0000 by daedalus__