System Prerequisites and Network Configuration
Before deploying the big data stack, establish a stable base environment. The following specifications apply to this deployment:
- OS: CentOS 7 (x86_64)
- Resources: 2 vCPUs, 4GB RAM, 50GB Disk
- Software Versions: OpenJDK 11, Hadoop 3.3.6, Apache Hive 3.1.3, MySQL 8.0 Community Server
Host Identity and Resolution
Assign a consistent identifier to the machine. This simplifies cluster networking scripts.
sudo hostnamectl set-hostname analytics-cluster-node
sudo systemctl restart systemd-hostnamed
Register the hostname in the local resolver table to prevent DNS lookup delays during daemon startup.
sudo bash -c 'echo "127.0.0.1 analytics-cluster-node" >> /etc/hosts'
Security and Time Synchronization
Disable packet filtering to allow inter-process communication and external management tools.
sudo systemctl stop firewalld.service
sudo systemctl disable firewalld.service
Verify clock accuracy across nodes. Discrepancies can cause HDFS replication issues.
timedatectl status
sudo chronyc tracking
Apache Hadoop Deployment
Environment Variables
Define runtime paths in the global profile to avoid hardcoding in scripts.
sudo nano /etc/profile.d/hadoop_env.sh
Append the folllowing definitions:
export HADOOP_INSTALL=/opt/modules/hadoop-3.3
export PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
Apply changes: source /etc/profile.d/hadoop_env.sh
Core Distributed Filesystem Settings
Configure basic filesystem behavior and RPC endpoints.
<!-- $HADOOP_INSTALL/etc/hadoop/core-site.xml -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://analytics-cluster-node:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/data/hadoop/tmp</value>
</property>
</configuration>
Storage Node Topology
Define metadata storage locations and replication factors.
<!-- $HADOOP_INSTALL/etc/hadoop/hdfs-site.xml -->
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/var/data/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/var/data/hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
</configuration>
Compute Framework Routing
Redirect MapReduce tasks to the YARN resource manager.
<!-- $HADOOP_INSTALL/etc/hadoop/mapred-site.xml -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.application.framework.path</name>
<value>$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*.jar</value>
</property>
</configuration>
Initialize the distributed namespace and launch background services.
hdfs namenode -format
start-dfs.sh
start-yarn.sh
jps
Validate service availability via browser endpoints: http://<ip>:8088 (ResourceManager) and http://<ip>:9870 (NameNode Web UI).
Relational Metadata Backend (MySQL)
Hive requires an external RDBMS to store schema definitions and lineage information. Install the community edition using package managers.
sudo yum install mysql-server mysql-devel -y
sudo systemctl enable --now mysqld
sudo systemctl status mysqld
Harden the database instance and prepare credentials for the metastore connector.
ALTER USER 'root'@'localhost' IDENTIFIED BY 'SecureMetastorePass!';
FLUSH PRIVILEGES;
Enable cross-network connectivity for administrative clients.
CREATE USER 'metastore_admin'@'%' IDENTIFIED BY 'ClusterAdminPwd!';
GRANT ALL PRIVILEGES ON *.* TO 'metastore_admin'@'%' WITH GRANT OPTION;
FLUSH PRIVILEGES;
Apache Hive Configuration
Binary Extraction and Path Mapping
Unarchive the distribution and expose binaries globally.
sudo tar -xf apache-hive-3.1.3-bin.tar.gz -C /opt/modules/
sudo mv /opt/modules/apache-hive-3.1.3-bin /opt/modules/hive-3.1
echo "export HIVE_HOME=/opt/modules/hive-3.1" >> /etc/profile.d/hadoop_env.sh
echo "export PATH=\$PATH:\$HIVE_HOME/bin" >> /etc/profile.d/hadoop_env.sh
source /etc/profile.d/hadoop_env.sh
Metastore Properties
Generate the active configuration file and direct queries toward MySQL.
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
Inject connection parameters and storage directives. Clear redundant placeholders to maintain readability.
<!-- $HIVE_HOME/conf/hive-site.xml -->
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://analytics-cluster-node:3306/hive_meta?createDatabaseIfNotExist=true&useSSL=false&serverTimezone=UTC</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>metastore_admin</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>ClusterAdminPwd!</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/hive/workdir</value>
</property>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
</configuration>
Create required HDFS directories and enforce access rights for the processing engine.
hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -chmod g+w /user/hive/warehouse
mkdir -p /tmp/hive/workdir
chmod 1777 /tmp/hive/workdir
Link native libraries and define auxiliary classpaths.
<!-- $HIVE_HOME/conf/hive-env.sh (Shell Syntax) -->
export HADOOP_HOME=/opt/modules/hadoop-3.3
export HIVE_CONF_DIR=/opt/modules/hive-3.1/conf
export HIVE_AUX_JARS_PATH=/opt/modules/hive-3.1/lib
Deploy the JDBC connector artifact into the runtime library pool.
Initialization and Query Validation
Bootstrap the metastore schema against the relational backend.
schematool -dbType mysql -initSchema
Launch the interactive command-line interface.
hive
Within the session, define a namespace and structural layout for tabular data.
CREATE DATABASE sales_analytics;
USE sales_analytics;
CREATE TABLE transaction_log (
tx_id INT,
customer_name STRING,
amount DECIMAL(10,2)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
Prepare a flat dataset containing pipe-delimited records. Transfer the file to the Linux workspace using an SFTP client.
# data_dump.tsv
1001 Alice 245.50
1002 Bob 112.00
1003 Carol 890.75
Ingest the payload into the managed table.
LOAD DATA LOCAL INPATH '/workspace/data_dump.tsv' INTO TABLE transaction_log;
SELECT tx_id, customer_name, amount FROM transaction_log WHERE amount > 100.00;