[pyspark 2.4.0] write with overwrite mode fails

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[pyspark 2.4.0] write with overwrite mode fails

Hi All,

df = spark.read.csv(PATH)
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.repartition(col1, col2).write.mode('overwrite').partitionBy('col1').parquet(OUT_PATH)

works fine and overwrites the partitioned directory as expected. 

However this doesn't overwrite when previous run was abruptly interrupted and the partitioned directory only has _started flag file & no _SUCCESS or _committed. In this case, second run doesn't overwrite, causing partition to have duplicated files. Could someone please help?


Rishi Shah