Cleanup hook for temporary files produced as part of a spark job

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Cleanup hook for temporary files produced as part of a spark job

jelmer
I am writing something that partitions a data set and then trains a machine learning model on the data in each partition

The resulting model is very big  and right now i am storing it in an rdd as a pair of  :
partition_id and very_big_model_that_is_hundreds_of_megabytes_big

but it is becoming increasingly apparent that storing data that big in a single row of an RDD causes all sorts of complications

So i figured that instead i could save this model to the filesystem and store a pointer to the model (file path) in the RDD.  Then i would simply load the model again in a mapPartitions function and avoid the issue

But it raises the question of when to clean up these temporary files. Is there some way to ensure that files outputted by spark code get cleaned up when the sparksession ends or the rdd is no longer referenced ?

Or is there any other solution to this problem ?