Understanding Grouping Sets in SQL

Introduction

The GROUP BY clause in SQL is widely recognized for organizing data according to specific criteria, often combined with aggregate functions.

Consider a table named dealer with the following entries:

id (Int) city (String) car_model (String) quantity (Int)
100 Fremont Honda Civic 10
100 Fremont Honda Accord 15
100 Fremont Honda CRV 7
200 Dublin Honda Civic 20
200 Dublin Honda Accord 10
200 Dublin Honda CRV 3
300 San Jose Honda Civic 5
300 San Jose Honda Accord 8

Executing the SQL query SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id yields:

 +---+-------------+
 | id|sum(quantity)|
 +---+-------------+
 |100|           32|
 |200|           33|
 |300|           13|
 +---+-------------+

This command groups records by the id column and sums the quantity within each group.

Beyond basic usage, GROUP BY supports advanced features like Grouping Sets, RollUp, and Cube, commonly used in OLAP contexts. Both RollUp and Cube are implemented using Grouping Sets, making understanding the latter essential to grasp their functionality.

This article introduces Grouping Sets and delves into its implementation within Spark SQL.

Overview of Grouping Sets

According to the Spark SQL documentation, the syntax for Grouping Sets states:

Groups the rows for each grouping set specified after GROUPING SETS. This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause.

Essential, Grouping Sets allows specifying multiple grouping rules, then combines results from these individual aggregations using UNION ALL. For instance, applying GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) on the dealer table gives the same output as performing separate queries and combining them via UNION ALL.

Here's the result for the Grouping Sets version:

 spark-sql> SELECT city, car_model, sum(quantity) AS sum FROM dealer 
          > GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) 
          > ORDER BY city, car_model;
 +--------+------------+---+
 |   city|   car_model|sum|
 +--------+------------+---+
 |    null|        null| 78|
 |    null|Honda Accord| 33|
 |    null|   Honda CRV| 10|
 |    null| Honda Civic| 35|
 | Dublin|        null| 33|
 | Dublin|Honda Accord| 10|
 | Dublin|   Honda CRV|  3|
 | Dublin| Honda Civic| 20|
 | Fremont|        null| 32|
 | Fremont|Honda Accord| 15|
 | Fremont|   Honda CRV|  7|
 | Fremont| Honda Civic| 10|
 |San Jose|        null| 13|
 |San Jose|Honda Accord|  8|
 |San Jose| Honda Civic|  5|
 +--------+------------+---+

And here’s the equivalent UNION ALL approach:

 spark-sql> (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL 
          > (SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL 
          > (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL 
          > (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) 
          > ORDER BY city, car_model;
 +--------+------------+---+
 |   city|   car_model|sum|
 +--------+------------+---+
 |    null|        null| 78|
 |    null|Honda Accord| 33|
 |    null|   Honda CRV| 10|
 |    null| Honda Civic| 35|
 | Dublin|        null| 33|
 | Dublin|Honda Accord| 10|
 | Dublin|   Honda CRV|  3|
 | Dublin| Honda Civic| 20|
 | Fremont|        null| 32|
 | Fremont|Honda Accord| 15|
 | Fremont|   Honda CRV|  7|
 | Fremont| Honda Civic| 10|
 |San Jose|        null| 13|
 |San Jose|Honda Accord|  8|
 |San Jose| Honda Civic|  5|
 +--------+------------+---+

Both approaches yield identical outputs.

Execution Plans

While both versions are functionally equivalent, the Grouping Sets version offers a more concise syntax. To examine whether Grouping Sets is merely syntactic sugar, let’s compare execution plans.

Using explain extended, the optimized logical plan for the UNION ALL version appears as follows:

 spark-sql> explain extended (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL 
          > (SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL 
          > (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL 
          > (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) 
          > ORDER BY city, car_model;
 == Parsed Logical Plan ==
 ...
 == Analyzed Logical Plan ==
 ...
 == Optimized Logical Plan ==
 Sort [city#93 ASC NULLS FIRST, car_model#94 ASC NULLS FIRST], true
 +- Union false, false
    :- Aggregate [city#93, car_model#94], [city#93, car_model#94, sum(quantity#95) AS sum#79L]
    :  +- Project [city#93, car_model#94, quantity#95]
    :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#92, city#93, car_model#94, quantity#95], Partition Cols: []]
    :- Aggregate [city#97], [city#97, null AS car_model#112, sum(quantity#99) AS sum#81L]
    :  +- Project [city#97, quantity#99]
    :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#96, city#97, car_model#98, quantity#99], Partition Cols: []]
    :- Aggregate [car_model#102], [null AS city#113, car_model#102, sum(quantity#103) AS sum#83L]
    :  +- Project [car_model#102, quantity#103]
    :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#100, city#101, car_model#102, quantity#103], Partition Cols: []]
    +- Aggregate [null AS city#114, null AS car_model#115, sum(quantity#107) AS sum#86L]
       +- Project [quantity#107]
          +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#104, city#105, car_model#106, quantity#107], Partition Cols: []]
 == Physical Plan ==
 ...

The optimized plan shows:

  1. Each subquery executes independently to compute results.
  2. Results are combined using the Union node.
  3. Final sorting occurs at the Sort node.

For the Grouping Sets version:

 spark-sql> explain extended SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) ORDER BY city, car_model;
 == Parsed Logical Plan ==
 ...
 == Analyzed Logical Plan ==
 ...
 == Optimized Logical Plan ==
 Sort [city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST], true
 +- Aggregate [city#138, car_model#139, spark_grouping_id#137L], [city#138, car_model#139, sum(quantity#133) AS sum#124L]
    +- Expand [[quantity#133, city#131, car_model#132, 0], [quantity#133, city#131, null, 1], [quantity#133, null, car_model#132, 2], [quantity#133, null, null, 3]], [quantity#133, city#138, car_model#139, spark_grouping_id#137L]
       +- Project [quantity#133, city#131, car_model#132]
          +- HiveTableRelation [`default`.`dealer`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#130, city#131, car_model#132, quantity#133], Partition Cols: []]
 == Physical Plan ==
 ...

This plan is simpler. Key steps include:

  1. Full scan of the dealer table.
  2. Projection of relevant fields.
  3. The Expand node adds a spark_grouping_id column, mapping each grouping set.
  4. Aggregation using city, car_model, and spark_grouping_id.
  5. Sorting at the end.

Implementation of the Expand Operator

In Spark SQL, the Expand operator is implemented through the ExpandExec class. Its core logic is outlined below:

 case class ExpandExec(
     projections: Seq[Seq[Expression]],
     output: Seq[Attribute],
     child: SparkPlan)
   extends UnaryExecNode with CodegenSupport {

   private[this] val projection =
    (exprs: Seq[Expression]) => UnsafeProjection.create(exprs, child.output)

   protected override def doExecute(): RDD[InternalRow] = {
     val numOutputRows = longMetric("numOutputRows")

     child.execute().mapPartitions { iter =>
       val groups = projections.map(projection).toArray
       new Iterator[InternalRow] {
         private[this] var result: InternalRow = _
         private[this] var idx = -1
         private[this] var input: InternalRow = _

         override final def hasNext: Boolean = (-1 < idx && idx < groups.length) || iter.hasNext

         override final def next(): InternalRow = {
           if (idx <= 0) {
             input = iter.next()
             idx = 0
          }
           result = groups(idx)(input)
           idx += 1

           if (idx == groups.length && iter.hasNext) {
             idx = 0
          }

           numOutputRows += 1
           result
        }
      }
    }
  }
 }

The key aspects are:

  1. projections holds the expression sequences for each grouping set.
  2. For each input record, Expand creates multiple output records.
  3. Each UnsafeProjection corresponds to one grouping set.

This mechanism effectively replicates the behavior of UNION ALL by expanding each input row into multiple rows before aggregation.

Performance Comparison

Benchmarking shows significant performance differences between Grouping Sets and UNION ALL:

 // Grouping Sets version
 Time taken: 0.289 seconds, Fetched 15 row(s)
 Time taken: 0.251 seconds, Fetched 15 row(s)
 Time taken: 0.259 seconds, Fetched 15 row(s)
 Time taken: 0.258 seconds, Fetched 15 row(s)
 Time taken: 0.296 seconds, Fetched 15 row(s)
 Time taken: 0.247 seconds, Fetched 15 row(s)
 Time taken: 0.298 seconds, Fetched 15 row(s)
 Time taken: 0.286 seconds, Fetched 15 row(s)
 Time taken: 0.292 seconds, Fetched 15 row(s)
 Time taken: 0.282 seconds, Fetched 15 row(s)

 // UNION ALL version
 Time taken: 0.628 seconds, Fetched 15 row(s)
 Time taken: 0.594 seconds, Fetched 15 row(s)
 Time taken: 0.591 seconds, Fetched 15 row(s)
 Time taken: 0.607 seconds, Fetched 15 row(s)
 Time taken: 0.616 seconds, Fetched 15 row(s)
 Time taken: 0.64 seconds, Fetched 15 row(s)
 Time taken: 0.623 seconds, Fetched 15 row(s)
 Time taken: 0.625 seconds, Fetched 15 row(s)
 Time taken: 0.62 seconds, Fetched 15 row(s)
 Time taken: 0.62 seconds, Fetched 15 row(s)

Average times are 0.276 seconds for Grouping Sets vs 0.616 seconds for UNION ALL, indicating a ~2.2x performance improvement.

RollUp and Cube

RollUp and Cube are also advanced GROUP BY features:

  • ROLLUP(A, B, C) equals GROUPING SETS((A, B, C), (A, B), (A), ())
  • CUBE(A, B, C) equals GROUPING SETS((A, B, C), (A, B), (A, C), (B, C), (A), (B), (C), ())

Their underlying implementations rely on Expand just like Grouping Sets, ensuring consistent performance.

Conclusion

Though Grouping Sets can be emulated with UNION ALL, it offers superior expressiveness and performance. RollUp and Cube are syntactic shortcuts built upon Grouping Sets, leveraging the same efficient execution model.

Tags: sql grouping-sets spark-sql olap Performance

Posted on Fri, 08 May 2026 20:20:41 +0000 by MBK