Overwrite only specific partition with hive dynamic partitioning

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Overwrite only specific partition with hive dynamic partitioning

Nirav Patel
Hi,

I have a hive partition table created using sparkSession. I would like to insert/overwrite Dataframe data to specific set of partition without loosing any other partition. In each run I have to update Set of partitions not just one.

e.g. I have dataframe with bid=1, bid=2, bid=3 in first time and I can write it  by using

`df.write.mode(SaveMode.Overwrite).partitionBy("bid").parquet(TableBaseLocation)`


It generates dirs: bid=1, bid=2, bid=3  inside TableBaseLocation

But next time when I have a dataframe with  bid=1, bid=4 and use same code above it removes bid=2 and bid=3. in other words I dont get idempotency.

I tried SaveMode.append but that creates duplicate data inside "bid=1"


I read 

With that approach it seems like I may have to updated multiple partition manually for each input partition. That seems like lot of work on every update. Is there a better way for this? 

Can this fix be apply to dataframe based approach as well?

Thanks



What's New with Xactly