Caching dataframes and overwrite

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Caching dataframes and overwrite

MidwestMike

I have been interested in finding out why I am getting strange behavior when running a certain spark job. The job will error out if I place an action (A .show(1) method) either right after caching the DataFrame or right before writing the dataframe back to hdfs. There is a very similar post to Stackoverflow post here... Spark SQL SaveMode.Overwrite, getting java.io.FileNotFoundException and requiring 'REFRESH TABLE tableName'.

Basically the other post explains, that when you read from the same hdfs directory that you are writing to, and your SaveMode is "overwrite", then you will get a java.io.FileNotFoundException. But here I am finding that just moving where in the program the action is can give very different results - either completing the program or giving this exception. I was wondering if anyone can explain why Spark is not being consistent here?

 val myDF = spark.read.format("csv")
    .option("header", "false")
    .option("delimiter", "\t")
    .schema(schema)
    .load(myPath)

// If I cache it here or persist it then do an action after the cache, it will occasionally 
// not throw the error. This is when completely restarting the SparkSession so there is no
// risk of another user interfering on the same JVM.
      myDF.cache()
      myDF.show(1)
// Below is just meant to be showing that we're are doing other "spark dataframe transformations", 
// but different transformations have both led to the weird behavior so, I'm not being specific about 
// what exactly the dataframe transformations are 

val secondDF = mergeOtherDFsWithmyDF(myDF, otherDF, thirdDF)

val fourthDF = mergeTwoDFs(thirdDF, StringToCheck, fifthDF)

// Below is the same .show(1) action call as was previously done, only this below
// action ALWAYS results in a successful completion and the above .show(1) sometimes results
// in FileNotFoundException and sometimes results in successful completion. The only
// thing that changes among test runs is only one is executed. Either
// **fourthDF.show(1) or myDF.show(1) is left commented out** 
fourthDF.show(1)
fourthDF.write
    .mode(writeMode)
    .option("header", "false")
    .option("delimiter", "\t")
    .csv(myPath)