Hadoop Cluster Configuration and Data Pipeline Setup for Offline Data Warehouse

When configuring a Hadoop cluster for an offline data warehouse, proper host mapping and configuration file adjustments are essential.

In core-site.xml, proxy user settings should allow access from any host, group, or user:

<property>
  <name>hadoop.proxyuser.atguigu.hosts</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.atguigu.groups</name>
  <value>*</value>
</property>
<property>
  <name>hadoop.proxyuser.atguigu.users</name>
  <value>*</value>
</property>

Use consistent hostname mappings (e.g., hadoop102, hadoop103, hadoop104) instead of raw IP addresses to simplify maintenance.

hdfs-site.xml

Set NameNode and SecondaryNameNode web UI addresses using mapped hostnames:

<property>
  <name>dfs.namenode.http-address</name>
  <value>hadoop102:9870</value>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>hadoop104:9868</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

Ensure the replication factor matches your cluster size and hardware capabilities.

yarn-site.xml

Configure YARN to run MapReduce jobs with shuffle service and appropriate memory limits:

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>hadoop103</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>512</value>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>3072</value>
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>3072</value>
</property>
<property>
  <name>yarn.nodemanager.pmem-check-enabled</name>
  <value>true</value>
</property>
<property>
  <name>yarn.nodemanager.vmem-check-enabled</name>
  <value>false</value>
</property>
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.log.server.url</name>
  <value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<property>
  <name>yarn.log-aggregation.retain-seconds</name>
  <value>604800</value>
</property>

Adjust memory settings based on each node’s available RAM. The ResourceManager runs on hadoop103, so ensure it has sufficient resources.

mapred-site.xml

Direct MapReduce jobs to run on YARN and configure job history server:

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>hadoop102:10020</value>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>hadoop102:19888</value>
</property>

workers

List all DataNode and NodeManager hosts using their mapped names—no spaces or blank lines:

hadoop102
hadoop103
hadoop104

Before formatting the NameNode, take a snapshot of your virtual machines. If formatting fails, delete the data and logs directories in your Hadoop installation before retrying.

Start HDFS on hadoop102 and YARN on hadoop103. For web UI access (e.g., http://hadoop102:9870), ensure:

  • Firewall is disabled on VMs.
  • Hosts file on Windows (C:\Windows\System32\drivers\etc\hosts) maps hadoop102 to its IP (e.g., 192.168.10.102).

Kafka and ZooKeeper

Start ZooKeeper first, then Kafka. When shutting down, stop Kafka before ZooKeeper to avoid forced termination via kill -9.

MySQL and Maxwell Setup

Create a dedicated MySQL user for Maxwell with required privileges:

CREATE DATABASE maxwell;
SET GLOBAL validate_password_policy=0;
SET GLOBAL validate_password_length=4;
CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
GRANT ALL ON maxwell.* TO 'maxwell'@'%';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';
FLUSH PRIVILEGES;

When connecting from a remote client (e.g., DataGrip), disable SSL by setting useSSL=false in the connection string to resolve protocol errors.

Maxwell Start/Stop Script

#!/bin/bash
MAXWELL_HOME=/opt/module/maxwell

status_maxwell() {
  local count=$(ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | wc -l)
  return $count
}

start_maxwell() {
  status_maxwell
  if [[ $? -eq 0 ]]; then
    echo "Starting Maxwell"
    $MAXWELL_HOME/bin/maxwell --config $MAXWELL_HOME/config.properties --daemon
  else
    echo "Maxwell is already running"
  fi
}

stop_maxwell() {
  status_maxwell
  if [[ $? -gt 0 ]]; then
    echo "Stopping Maxwell"
    pids=$(ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | awk '{print $2}')
    kill -9 $pids
  else
    echo "Maxwell is not running"
  fi
}

case "$1" in
  start) start_maxwell ;;
  stop) stop_maxwell ;;
  restart) stop_maxwell; start_maxwell ;;
  *) echo "Usage: $0 {start|stop|restart}" ;;
esac

Ensure the Kafka topic name in Maxwell’s config matches the consumer’s topic exactly.

Flume Configuration

When implementing custom interceptors in Flume:

  • Temporarily comment out interceptor configurations until the JAR is deployed.
  • Use the fully qualified class name of your interceptor as defined in your code.
  • Verify all services (HDFS, Kafka, ZooKeeper, Maxwell, Flume agents) are running during data synchronization tests.

Tags: Hadoop Data Warehouse YARN HDFS Kafka

Posted on Thu, 07 May 2026 07:42:31 +0000 by bruckerrlb