[Spark SQL]: Dataframe group by potential bug (Scala)
This post was updated on .
This is using Spark Scala 2.4.4. I'm getting some very strange behaviour
after reading in a dataframe from a json file, using sparkSession.read in
permissive mode. I've included the error column when reading in the data, as
I want to log details of any errors in the input json file.
My suspicion is that I've found a bug in Spark, though I'm happy to be
wrong. I can't find any reference to this issue online.
The salesTotal is now correct. However, if I then process this dataframe
further, for example by dropping the allRecords column, or converting it to
a DataSet based on a simple case class, the salesTotals revert to the
The only reliable way I've found to handle this is to process the allRecords
column via an explode, and then group the resulting records again.
In a single statement:
val allInOneReport = validSales.groupBy("game")
sum(col("sales")) as "salesTotal"