When configuring a Hadoop cluster for an offline data warehouse, proper host mapping and configuration file adjustments are essential.
In core-site.xml, proxy user settings should allow access from any host, group, or user:
<property>
<name>hadoop.proxyuser.atguigu.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.atguigu.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.atguigu.users</name>
<value>*</value>
</property>
Use consistent hostname mappings (e.g., hadoop102, hadoop103, hadoop104) instead of raw IP addresses to simplify maintenance.
hdfs-site.xml
Set NameNode and SecondaryNameNode web UI addresses using mapped hostnames:
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop102:9870</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop104:9868</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
Ensure the replication factor matches your cluster size and hardware capabilities.
yarn-site.xml
Configure YARN to run MapReduce jobs with shuffle service and appropriate memory limits:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop103</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>3072</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop102:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
Adjust memory settings based on each node’s available RAM. The ResourceManager runs on hadoop103, so ensure it has sufficient resources.
mapred-site.xml
Direct MapReduce jobs to run on YARN and configure job history server:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hadoop102:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hadoop102:19888</value>
</property>
workers
List all DataNode and NodeManager hosts using their mapped names—no spaces or blank lines:
hadoop102
hadoop103
hadoop104
Before formatting the NameNode, take a snapshot of your virtual machines. If formatting fails, delete the data and logs directories in your Hadoop installation before retrying.
Start HDFS on hadoop102 and YARN on hadoop103. For web UI access (e.g., http://hadoop102:9870), ensure:
- Firewall is disabled on VMs.
- Hosts file on Windows (
C:\Windows\System32\drivers\etc\hosts) mapshadoop102to its IP (e.g.,192.168.10.102).
Kafka and ZooKeeper
Start ZooKeeper first, then Kafka. When shutting down, stop Kafka before ZooKeeper to avoid forced termination via kill -9.
MySQL and Maxwell Setup
Create a dedicated MySQL user for Maxwell with required privileges:
CREATE DATABASE maxwell;
SET GLOBAL validate_password_policy=0;
SET GLOBAL validate_password_length=4;
CREATE USER 'maxwell'@'%' IDENTIFIED BY 'maxwell';
GRANT ALL ON maxwell.* TO 'maxwell'@'%';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'maxwell'@'%';
FLUSH PRIVILEGES;
When connecting from a remote client (e.g., DataGrip), disable SSL by setting useSSL=false in the connection string to resolve protocol errors.
Maxwell Start/Stop Script
#!/bin/bash
MAXWELL_HOME=/opt/module/maxwell
status_maxwell() {
local count=$(ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | wc -l)
return $count
}
start_maxwell() {
status_maxwell
if [[ $? -eq 0 ]]; then
echo "Starting Maxwell"
$MAXWELL_HOME/bin/maxwell --config $MAXWELL_HOME/config.properties --daemon
else
echo "Maxwell is already running"
fi
}
stop_maxwell() {
status_maxwell
if [[ $? -gt 0 ]]; then
echo "Stopping Maxwell"
pids=$(ps -ef | grep com.zendesk.maxwell.Maxwell | grep -v grep | awk '{print $2}')
kill -9 $pids
else
echo "Maxwell is not running"
fi
}
case "$1" in
start) start_maxwell ;;
stop) stop_maxwell ;;
restart) stop_maxwell; start_maxwell ;;
*) echo "Usage: $0 {start|stop|restart}" ;;
esac
Ensure the Kafka topic name in Maxwell’s config matches the consumer’s topic exactly.
Flume Configuration
When implementing custom interceptors in Flume:
- Temporarily comment out interceptor configurations until the JAR is deployed.
- Use the fully qualified class name of your interceptor as defined in your code.
- Verify all services (HDFS, Kafka, ZooKeeper, Maxwell, Flume agents) are running during data synchronization tests.