This approach leverages Spark's implicit conversinos to infer column names from case class attributes.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName("RDDConversionExample")
.master("local[*]")
.getOrCreate()
import spark.implicits._
case class User(id: Int, username: String, score: Double)
val userRDD = spark.sparkContext.parallelize(Seq(
User(1, "Alex", 85.5),
User(2, "Beth", 92.3),
User(3, "Chris", 78.9)
))
val userDF = userRDD.toDF()
userDF.show()
Method 2: Using toDF with Explicit Column Names
When working with tuple-based RDDs, you can specify custom column names during conversion.
val productRDD = spark.sparkContext.parallelize(Seq(
("Laptop", 1200.00, 15),
("Phone", 899.99, 32),
("Tablet", 450.50, 27)
))
val productDF = productRDD.toDF("product_name", "price", "stock")
productDF.show()
Method 3: Using createDataFrame with Defined Schema
For complete control over column names and data types, create a schema and convert Row-based RDDs.
import org.apache.spark.sql.{Row, types}
val transactionRDD = spark.sparkContext.parallelize(Seq(
Row("2023-01-01", 1001, 149.99),
Row("2023-01-02", 1002, 299.50),
Row("2023-01-03", 1003, 99.95)
))
val transactionSchema = types.StructType(Array(
types.StructField("date", types.StringType, nullable = false),
types.StructField("order_id", types.IntegerType, nullable = false),
types.StructField("amount", types.DoubleType, nullable = false)
))
val transactionDF = spark.createDataFrame(transactionRDD, transactionSchema)
transactionDF.show()