To build a Spark distribution compatible with Hadoop CDH 5.7.0 and Hive support, navigate to the Spark source directory:
[hadoop@hadoop001 spark-2.1.0]$ pwd
/home/hadoop/source/spark-2.1.0
Compile using Maven with profiles for YARN, Hadoop 2.6, Hive, and Hive Thriftserver:
./build/mvn -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver \
-Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package
Alternatively, generate a distributable tarball:
./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz \
-Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver \
-Dhadoop.version=2.6.0-cdh5.7.0
This produces spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz.
If dependency resolution fails due to missing Cloudera artifacts, add the Cloudera repository to pom.xml:
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
For Scala 2.10 compatibility, run:
./dev/change-scala-version.sh 2.10
Local Mode Setup
Extract the built archive:
tar -zxvf spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz -C ~/app/
Set SPARK_HOME and source your shell profile. When launching spark-shell, include the MySQL JDBC driver if Hive metastore uses MySQL:
spark-shell --master local[2] \
--jars /home/hadoop/software/mysql-connector-java-5.1.27-bin.jar
Without the driver, you’ll encounter DatastoreDriverNotFoundException.
Standalone Cluster Mode
Configure conf/spark-env.sh:
SPARK_MASTER_HOST=hadoop001
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=1
List worker nodes in conf/slaves:
hadoop2
hadoop3
...
hadoop10
Start the cluster with sbin/start-all.sh. This launches a master on hadoop001 and workers on all listed hosts.
Example WordCount in Scala:
val file = spark.sparkContext.textFile("file:///home/hadoop/data/wc.txt")
val wordCounts = file
.flatMap(_.split("\\s+"))
.map(word => (word, 1))
.reduceByKey(_ + _)
wordCounts.collect()
IDE Development Configuration
When running Spark applications locally in an IDE like IntelliJ, specify the master URL via VM options:
-Dspark.master=local
Example using SQLContext:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object SQLContextApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SQLContextApp")
val sc = new SparkContext(conf)
val sqlCtx = new SQLContext(sc)
val people = sqlCtx.read.json("people.json")
people.printSchema()
people.show()
sc.stop()
}
}
Package with Maven (ensure MAVEN_HOME and PATH are set), then submit:
spark-submit \
--class org.example.SQLContextApp \
--master local[2] \
/home/hadoop/lib/sql-app.jar \
people.json
Hive Integration
Use HiveContext (deprecated in favor of SparkSession) or directly use SparkSession with Hive support. Ensure hive-site.xml is copied to $SPARK_HOME/conf.
Example with HiveContext:
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
hiveCtx.sql("SELECT * FROM emp").show()
Submit with MySQL driver:
spark-submit \
--class org.example.HiveContextApp \
--master local[2] \
--jars /home/hadoop/software/mysql-connector-java-5.1.27-bin.jar \
/home/hadoop/lib/hive-app.jar
Modern approach using SparkSession:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("SparkSessionApp")
.master("local[2]")
.enableHiveSupport()
.getOrCreate()
spark.sql("SHOW TABLES").show()
spark.stop()
Performance Comparison
Queries executed via Spark SQL typically complete in under a second, whereas equivalent Hive CLI queries may take tens of seconds due to MapReduce overhead.
To suppress metastore schema version warnings in Hive, add to hive-site.xml:
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>