[Spark SQL]: DataFrame schema resulting in NullPointerException

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL]: DataFrame schema resulting in NullPointerException

chitralverma
Hey,

I'm working on this use case that involves converting DStreams to Dataframes after some transformations. I've simplified my code into the following snippet so as to reproduce the error. Also, I've mentioned below my environment settings.

Environment:
Spark Version: 2.2.0
Java: 1.8
Execution mode: local/ IntelliJ

Code:
object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  
import spark.implicits._

    val df = List(
        ("jim", "usa"), 
        ("raj", "india"))
        .toDF("name", "country")

    df.rdd
      .map(x => x.toSeq)
      .map(x => new GenericRowWithSchema(x.toArray, df.schema))
      .foreach(println)
  }
}

This results in NullPointerException as I'm directly using df.schema in map().

What I don't understand is that if I use the following code (basically storing the schema as a value before transforming), it works just fine.
object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  
import spark.implicits._

    val df = List(
        ("jim", "usa"), 
        ("raj", "india"))
        .toDF("name", "country")
    val sc = df.schema

    df.rdd
      .map(x => x.toSeq)
      .map(x => new GenericRowWithSchema(x.toArray, sc))
.foreach(println) } }

I wonder why this is happening as df.rdd is not an action and there is visible change in state of dataframe just yet. What are your thoughts on this? 

Regards,
Chitral Verma