Configuring and Running Spark SQL with Hive Integration

To build a Spark distribution compatible with Hadoop CDH 5.7.0 and Hive support, navigate to the Spark source directory:

[hadoop@hadoop001 spark-2.1.0]$ pwd
/home/hadoop/source/spark-2.1.0

Compile using Maven with profiles for YARN, Hadoop 2.6, Hive, and Hive Thriftserver:

./build/mvn -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver \
  -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package

Alternatively, generate a distributable tarball:

./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz \
  -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver \
  -Dhadoop.version=2.6.0-cdh5.7.0

This produces spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz.

If dependency resolution fails due to missing Cloudera artifacts, add the Cloudera repository to pom.xml:

<repositories>
  <repository>
    <id>cloudera</id>
    <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  </repository>
</repositories>

For Scala 2.10 compatibility, run:

./dev/change-scala-version.sh 2.10

Local Mode Setup

Extract the built archive:

tar -zxvf spark-2.1.0-bin-2.6.0-cdh5.7.0.tgz -C ~/app/

Set SPARK_HOME and source your shell profile. When launching spark-shell, include the MySQL JDBC driver if Hive metastore uses MySQL:

spark-shell --master local[2] \
  --jars /home/hadoop/software/mysql-connector-java-5.1.27-bin.jar

Without the driver, you’ll encounter DatastoreDriverNotFoundException.

Standalone Cluster Mode

Configure conf/spark-env.sh:

SPARK_MASTER_HOST=hadoop001
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=1

List worker nodes in conf/slaves:

hadoop2
hadoop3
...
hadoop10

Start the cluster with sbin/start-all.sh. This launches a master on hadoop001 and workers on all listed hosts.

Example WordCount in Scala:

val file = spark.sparkContext.textFile("file:///home/hadoop/data/wc.txt")
val wordCounts = file
  .flatMap(_.split("\\s+"))
  .map(word => (word, 1))
  .reduceByKey(_ + _)
wordCounts.collect()

IDE Development Configuration

When running Spark applications locally in an IDE like IntelliJ, specify the master URL via VM options:

-Dspark.master=local

Example using SQLContext:

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object SQLContextApp {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SQLContextApp")
    val sc = new SparkContext(conf)
    val sqlCtx = new SQLContext(sc)

    val people = sqlCtx.read.json("people.json")
    people.printSchema()
    people.show()

    sc.stop()
  }
}

Package with Maven (ensure MAVEN_HOME and PATH are set), then submit:

spark-submit \
  --class org.example.SQLContextApp \
  --master local[2] \
  /home/hadoop/lib/sql-app.jar \
  people.json

Hive Integration

Use HiveContext (deprecated in favor of SparkSession) or directly use SparkSession with Hive support. Ensure hive-site.xml is copied to $SPARK_HOME/conf.

Example with HiveContext:

import org.apache.spark.sql.hive.HiveContext

val hiveCtx = new HiveContext(sc)
hiveCtx.sql("SELECT * FROM emp").show()

Submit with MySQL driver:

spark-submit \
  --class org.example.HiveContextApp \
  --master local[2] \
  --jars /home/hadoop/software/mysql-connector-java-5.1.27-bin.jar \
  /home/hadoop/lib/hive-app.jar

Modern approach using SparkSession:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("SparkSessionApp")
  .master("local[2]")
  .enableHiveSupport()
  .getOrCreate()

spark.sql("SHOW TABLES").show()
spark.stop()

Performance Comparison

Queries executed via Spark SQL typically complete in under a second, whereas equivalent Hive CLI queries may take tens of seconds due to MapReduce overhead.

To suppress metastore schema version warnings in Hive, add to hive-site.xml:

<property>
  <name>hive.metastore.schema.verification</name>
  <value>false</value>
</property>

Tags: Spark Spark SQL Hive Big Data Scala

Posted on Thu, 07 May 2026 02:26:00 +0000 by kporter.porter