Apache Spark serves as a quasi-real-time big data processing engine that requires resource scheduling and task management. While Spark includes its own standalone resource scheduler, it also supports deployment on external platforms such as Yarn, Mesos, and Kubernetes.
This guide covers three deployment modes:
- Local Mode: Ideal for local development and testing in IDEs
- Standalone Mode: Spark's built-in distributed resource scheduler
- Yarn Mode: Integration with Hadoop's resource management framework
Environment Setup
Prerequisites: CentOS 7, JDK 8, Hadoop 2.10.1, Spark 3.0.1
Required Software
Ensure the following components are installed:
- JDK 1.8
- Scala 2.12.12
- Hadoop 2.10.1
- Spark 3.0.1 (hadoop2.7 variant)
Machine Configuration
This guide uses a four-node cluster with the following hosts:
s141 - Master node
s142, s143, s144, s145 - Worker nodes
Passwordless SSH access between nodes is required for cluster communication.
Environment Variables Configuration
Scala Installation
tar -zxvf scala-2.12.12.tgz -C /opt/soft/
cd /opt/soft
ln -s scala-2.12.12 scala
vim /etc/profile
# Add the following lines:
export SCALA_HOME=/opt/soft/scala
export PATH=$PATH:$SCALA_HOME/bin
source /etc/profile
Verify the installation:
scala -version
Spark Installation
tar -zxvf spark-3.0.1-bin-hadoop2.7.tgz -C /opt/soft
cd /opt/soft
ln -s spark-3.0.1-bin-hadoop2.7 spark
vim /etc/profile
# Add the following lines:
export SPARK_HOME=/opt/soft/spark
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile
Cluster Mode Deployment
Local Mode
This mode runs Spark entirely within a single JVM process, suitable for development and testing.
Execusion:
cd /opt/soft/spark/bin
./run-example SparkPi 10
Interactive Shell:
spark-shell
The shell should start successfully, confirming Local mode is operational.
Standalone Mode
This mode utilizes Spark's built-in cluster manager for fully distributed operation.
Configuration Steps
1. Configure spark-env.sh:
cd /opt/soft/spark/conf
cp spark-env.sh.template spark-env.sh
vim spark-env.sh
Add the following parameters:
export SPARK_MASTER_HOST=s141
export SPARK_MASTER_PORT=7077
2. Configure worker nodes:
cd /opt/soft/spark/conf
cp slaves.template slaves
vim slaves
Add worker hostnames:
s142
s143
s144
s145
3. Distribute Spark to all nodes:
xrsync.sh spark-3.0.1-bin-hadoop2.7
4. Create symbolic links on all nodes:
xcall.sh ln -s /opt/soft/spark-3.0.1-bin-hadoop2.7 /opt/soft/spark
5. Start the cluster:
cd /opt/soft/spark/sbin
# Option 1: Start individually
./start-master.sh
./start-slaves.sh spark://s141:7077
# Option 2: Start all at once
./start-all.sh
Verify process status using jps - master and worker processes should be running.
6. Access the web UI:
http://192.168.30.141:8080
If port 8080 is unavailable, check master logs for the alternative port (typically 8081).
7. Verify with Spark Shell:
cd /opt/soft/spark/bin
./spark-shell --master spark://s141:7077
Yarn Cluster Mode
Running Spark on Yarn leverages Hadoop's resource management capabilities.
Configuration
1. Update spark-env.sh:
Add Hadoop configuration directory to the existing spark-env.sh:
export HADOOP_CONF_DIR=/opt/soft/hadoop/etc/hadoop
Distribute this configuration to all nodes.
2. Start Hadoop Yarn:
start-yarn.sh
3. Start Spark cluster:
cd /opt/soft/spark/sbin
./start-all.sh
4. Verify cluster processes:
Confirm master and worker processes are running on all nodes.
5. Access web interfaces:
- Spark Master UI: http://192.168.30.141:8080
- Yarn ResourceManager: http://192.168.30.141:8088
6. Connect to Yarn:
cd /opt/soft/spark/bin
./spark-shell --master yarn
Yarn Log Aggregation Configuration
Hadoop Configuration Updates
Modify mapred-site.xml to enable job history:
<!-- Spark Yarn - Job History Server - >
<property>
<name>mapreduce.jobhistory.address</name>
<value>s141:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>s141:19888</value>
</property>
Modify yarn-site.xml for memory management and log aggregation:
<!-- Disable physical memory check -->
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!-- Disable virtual memory check -->
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<!-- Enable log aggregation -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- Log server URL for external applications -->
<property>
<name>yarn.log.server.url</name>
<value>http://s141:19888/jobhistory/logs</value>
</property>
<!-- Log retention period in seconds (24 hours) -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>86400</value>
</property>
Distribute these configuration files to all cluster nodes.
Create Hadoop Configuration Symlikns
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/core-site.xml /opt/soft/spark/conf/core-site.xml
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/hdfs-site.xml /opt/soft/spark/conf/hdfs-site.xml
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/yarn-site.xml /opt/soft/spark/conf/yarn-site.xml
xcall.sh ln -s /opt/soft/hadoop/etc/hadoop/mapred-site.xml /opt/soft/spark/conf/mapred-site.xml
Configure Spark History Server
1. Create HDFS directory for event logs:
hdfs dfs -mkdir -p hdfs://mycluster/spark/logs
2. Configure spark-defaults.conf:
cp spark-defaults.conf.template spark-defaults.conf
vim spark-defaults.conf
Add the following:
spark.eventLog.enabled true
spark.eventLog.dir hdfs://mycluster/spark/logs
spark.yarn.historyServer.address 192.168.30.141:18080
spark.history.ui.port 18080
3. Configure spark-env.sh for history server:
vim spark-env.sh
Add:
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.fs.logDirectory=hdfs://mycluster/spark/logs -Dspark.history.retainedApplications=30"
4. Start Hadoop Job History Server:
mr-jobhistory-daemon.sh start historyserver
5. Start Spark History Server:
sbin/start-history-server.sh
Testing the Deployment
Submit a sample application to verify the configuration:
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
./examples/jars/spark-examples_2.12-3.0.1.jar \
10
Monitor the application through:
- Yarn ResourceManager UI: http://192.168.30.141:8088
- Spark History Server UI: http://192.168.30.141:18080
Cluster Startup Sequence
# Start HDFS
start-all.sh
# Start Hadoop Job History Server
mr-jobhistory-daemon.sh start historyserver
# Start Spark cluster
sbin/start-all.sh
# Start Spark History Server
sbin/start-history-server.sh
Summary
This guide covered three Spark deployment modes: Local for development, Standalone for independent clusters, and Yarn for Hadoop integration. The log aggregation configuration enables centralized log access through Yarn's history server and Spark's history server, essential for debugging and monitoring production workloads.