Spark doesn't add _SUCCESS file when 'partitionBy' is used

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark doesn't add _SUCCESS file when 'partitionBy' is used

Eric Beabes
When I do the following, Spark( 2.4) doesn't put _SUCCESS file in the partition directory:

val outputPath = s"s3://mybucket/$table"
df
.orderBy(time)
.coalesce(numFiles)
.write
.partitionBy("partitionDate")
.mode("overwrite")
.format("parquet")
.save(outputPath)

But when I remove 'partitionBy' & add partition info in the outputPath as shown below, I do see the _SUCCESS file.

Questions:
1) Is the following solution acceptable?
2) Would this cause problems elsewhere if I don't use the 'partitionBy' clause?
3) Is there a better way to ensure that _SUCCESS file is created in each partition?

val outputPath = s"s3://mybucket/$table/date=<some date>"
df
.orderBy(time)
.coalesce(numFiles)
.write
.mode("overwrite")
.format("parquet")
.save(outputPath)