Setting Up Apache Spark 3.0.1 Cluster Modes and Configuring Yarn Log Aggregation

Apache Spark serves as a quasi-real-time big data processing engine that requires resource scheduling and task management. While Spark includes its own standalone resource scheduler, it also supports deployment on external platforms such as Yarn, Mesos, and Kubernetes.

This guide covers three deployment modes:

  • Local Mode: Ideal for local development and testing in IDEs
  • Standalone Mode: Spark's built-in distributed resource scheduler
  • Yarn Mode: Integration with Hadoop's resource management framework

Environment Setup

Prerequisites: CentOS 7, JDK 8, Hadoop 2.10.1, Spark 3.0.1

Required Software

Ensure the following components are installed:

  • JDK 1.8
  • Scala 2.12.12
  • Hadoop 2.10.1
  • Spark 3.0.1 (hadoop2.7 variant)

Machine Configuration

This guide uses a four-node cluster with the following hosts:

s141 - Master node
s142, s143, s144, s145 - Worker nodes

Passwordless SSH access between nodes is required for cluster communication.

Environment Variables Configuration

Scala Installation

tar -zxvf scala-2.12.12.tgz -C /opt/soft/
cd /opt/soft
ln -s scala-2.12.12 scala

vim /etc/profile
# Add the following lines:
export SCALA_HOME=/opt/soft/scala
export PATH=$PATH:$SCALA_HOME/bin

source /etc/profile

Verify the installation:

scala -version

Spark Installation

tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz -C /opt/soft
cd /opt/soft
ln -s spark-3.0.1-bin-hadoop2.7 spark

vim /etc/profile
# Add the following lines:
export SPARK_HOME=/opt/soft/spark
export PATH=$PATH:$SPARK_HOME/bin

source /etc/profile

Cluster Mode Deployment

Local Mode

This mode runs Spark entirely within a single JVM process, suitable for development and testing.

Execusion:

cd /opt/soft/spark/bin
./run-example SparkPi 10

Interactive Shell:

spark-shell

The shell should start successfully, confirming Local mode is operational.

Standalone Mode

This mode utilizes Spark's built-in cluster manager for fully distributed operation.

Configuration Steps

1. Configure spark-env.sh:

cd /opt/soft/spark/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh

Add the following parameters:

export SPARK_MASTER_HOST=s141
export SPARK_MASTER_PORT=7077

2. Configure worker nodes:

cd /opt/soft/spark/conf
cp slaves.template slaves
vim slaves

Add worker hostnames:

s142
s143
s144
s145

3. Distribute Spark to all nodes:

xrsync.sh spark-3.0.1-bin-hadoop2.7

4. Create symbolic links on all nodes:

xcall.sh ln -s /opt/soft/spark-3.0.1-bin-hadoop2.7 /opt/soft/spark

5. Start the cluster:

cd /opt/soft/spark/sbin

# Option 1: Start individually
./start-master.sh
./start-slaves.sh spark://s141:7077

# Option 2: Start all at once
./start-all.sh

Verify process status using jps - master and worker processes should be running.

6. Access the web UI:

http://192.168.30.141:8080

If port 8080 is unavailable, check master logs for the alternative port (typically 8081).

7. Verify with Spark Shell:

cd /opt/soft/spark/bin
./spark-shell --master spark://s141:7077

Yarn Cluster Mode

Running Spark on Yarn leverages Hadoop's resource management capabilities.

Configuration

1. Update spark-env.sh:

Add Hadoop configuration directory to the existing spark-env.sh:

export HADOOP_CONF_DIR=/opt/soft/hadoop/etc/hadoop

Distribute this configuration to all nodes.

2. Start Hadoop Yarn:

start-yarn.sh

3. Start Spark cluster:

cd /opt/soft/spark/sbin
./start-all.sh

4. Verify cluster processes:

Confirm master and worker processes are running on all nodes.

5. Access web interfaces:

6. Connect to Yarn:

cd /opt/soft/spark/bin
./spark-shell --master yarn

Yarn Log Aggregation Configuration

Hadoop Configuration Updates

Modify mapred-site.xml to enable job history:

<!-- Spark Yarn - Job History Server - >
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>s141:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>s141:19888</value>
</property>

Modify yarn-site.xml for memory management and log aggregation:

<!-- Disable physical memory check -->
<property>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>false</value>
</property>

<!-- Disable virtual memory check -->
<property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
</property>

<!-- Enable log aggregation -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<!-- Log server URL for external applications -->
<property>
    <name>yarn.log.server.url</name>
    <value>http://s141:19888/jobhistory/logs</value>
</property>

<!-- Log retention period in seconds (24 hours) -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>86400</value>
</property>

Distribute these configuration files to all cluster nodes.

Create Hadoop Configuration Symlikns

xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/core-site.xml /opt/soft/spark/conf/core-site.xml
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/hdfs-site.xml /opt/soft/spark/conf/hdfs-site.xml
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/yarn-site.xml /opt/soft/spark/conf/yarn-site.xml
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/mapred-site.xml /opt/soft/spark/conf/mapred-site.xml

Configure Spark History Server

1. Create HDFS directory for event logs:

hdfs dfs -mkdir -p hdfs://mycluster/spark/logs

2. Configure spark-defaults.conf:

cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf

Add the following:

spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://mycluster/spark/logs
spark.yarn.historyServer.address 192.168.30.141:18080
spark.history.ui.port            18080

3. Configure spark-env.sh for history server:

vim spark-env.sh

Add:

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://mycluster/spark/logs -Dspark.history.retainedApplications=30"

4. Start Hadoop Job History Server:

mr-jobhistory-daemon.sh start historyserver

5. Start Spark History Server:

sbin/start-history-server.sh

Testing the Deployment

Submit a sample application to verify the configuration:

bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode client \
  ./examples/jars/spark-examples_2.12-3.0.1.jar \
  10

Monitor the application through:

Cluster Startup Sequence

# Start HDFS
start-all.sh

# Start Hadoop Job History Server
mr-jobhistory-daemon.sh start historyserver

# Start Spark cluster
sbin/start-all.sh

# Start Spark History Server
sbin/start-history-server.sh

Summary

This guide covered three Spark deployment modes: Local for development, Standalone for independent clusters, and Yarn for Hadoop integration. The log aggregation configuration enables centralized log access through Yarn's history server and Spark's history server, essential for debugging and monitoring production workloads.

Tags: Apache Spark Spark Cluster YARN Big Data Hadoop

Posted on Wed, 13 May 2026 04:54:52 +0000 by installer69