Difference between Checkpointing and Persist

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Difference between Checkpointing and Persist

Subash Prabakar
Hi All,

I have a doubt about checkpointing and persist/saving. 

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving is to just persist intermediate result before joining)


Which of the above is faster and whats the difference?


Thanks,
Subash
Reply | Threaded
Open this post in threaded view
|

Re: Difference between Checkpointing and Persist

Jack Kolokasis
Hi,

     in my point of view a good approach is first persist your data in
StorageLevel.Memory_And_Disk and then perform join. This will accelerate
your computation because data will be presented in memory and in your
local intermediate storage device.

--Iacovos

On 4/18/19 8:49 PM, Subash Prabakar wrote:

> Hi All,
>
> I have a doubt about checkpointing and persist/saving.
>
> Say we have one RDD - containing huge data,
> 1. We checkpoint and perform join
> 2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
> 3. We save that intermediate RDD and perform join (using same RDD -
> saving is to just persist intermediate result before joining)
>
>
> Which of the above is faster and whats the difference?
>
>
> Thanks,
> Subash

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Difference between Checkpointing and Persist

Vadim Semenov-2
In reply to this post by Subash Prabakar
saving/checkpointing would be preferable in case of a big data set because:

- the RDD gets saved to HDFS and the DAG gets truncated so if some partitions/executors fail it won't result in recomputing everything

- you don't use memory for caching therefore the JVM heap is going to be smaller which helps GC and overall there'll be more memory for other operations

- by saving to HDFS you're removing potential hotspots since partitions can be fetched from many DataNodes vs when you get a hot partition that gets requested a lot by other executors you may end up with an overwhelmed executor

> We save that intermediate RDD and perform join (using same RDD - saving is to just persist intermediate result before joining)
Checkpointing is essentially saving the RDD and reading it back, however you can't read checkpointed data if the job failed so it'd be nice to have one part of the join saved in case of potential issues.

Overall, in my opinion, when working with big joins you should pay more attention to reliability and fault-tolerance rather than pure speed as the probability of having issues grows with increasing the dataset size and cluster size.

On Thu, Apr 18, 2019 at 1:49 PM Subash Prabakar <[hidden email]> wrote:
Hi All,

I have a doubt about checkpointing and persist/saving. 

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving is to just persist intermediate result before joining)


Which of the above is faster and whats the difference?


Thanks,
Subash


--
Sent from my iPhone
Reply | Threaded
Open this post in threaded view
|

Re: Difference between Checkpointing and Persist

gene.pang
In reply to this post by Subash Prabakar
Hi Subash,

I'm not sure how the checkpointing works, but with StorageLevel.MEMORY_AND_DISK, Spark will store the RDD in on-heap memory, and spill to disk if necessary. However, the data is only usable by that Spark job. Saving the RDD will write the data out to an external storage system, like HDFS or Alluxio.

There are some advantages of saving the RDD, mainly allowing different jobs or even different frameworks to read that data. One possibility is to save the RDD to Alluxio, which can store the data in-memory, improving the throughput by avoiding the disk. Here is an article discussing different ways to store RDDs

Thanks,
Gene

On Thu, Apr 18, 2019 at 10:49 AM Subash Prabakar <[hidden email]> wrote:
Hi All,

I have a doubt about checkpointing and persist/saving. 

Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving is to just persist intermediate result before joining)


Which of the above is faster and whats the difference?


Thanks,
Subash