Is there any possibility to avoid double computation in case of RDD checkpointing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Is there any possibility to avoid double computation in case of RDD checkpointing

Ivan Petrov
Hi!
i use RDD checkpoint before writing to mongo to avoid duplicate records in DB. Seems like Driver writes the same data twice in case of task failure.
- data calculated
- mongo _id created
- spark mongo connector writes data to Mongo
- task crashes
- (BOOM!) spark recomputes partition and gets new _id for mongo records
- i get duplicate records in Mongo

So I've added a checkpoint before writing to mongo.
Now Spark doubled execution runtime because of checkpoint. 
What is the right way to avoid it? i think to save data to HDFS and then read and write it to mongo instead of using checkpoint...
is it viable idea?