Cleanup hook for temporary files produced as part of a spark job
I am writing something that partitions a data set and then trains a machine learning model on the data in each partition
The resulting model is very big and right now i am storing it in an rdd as a pair of : partition_id and very_big_model_that_is_hundreds_of_megabytes_big
but it is becoming increasingly apparent that storing data that big in a single row of an RDD causes all sorts of complications
So i figured that instead i could save this model to the filesystem and store a pointer to the model (file path) in the RDD. Then i would simply load the model again in a mapPartitions function and avoid the issue
But it raises the question of when to clean up these temporary files. Is there some way to ensure that files outputted by spark code get cleaned up when the sparksession ends or the rdd is no longer referenced ?