Appropriate checkpoint interval in a spark streaming application

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Appropriate checkpoint interval in a spark streaming application

sheelstera
Hello, 

I am trying to figure an appropriate checkpoint interval for my spark streaming application. Its Spark Kafka integration based on Direct Streams. 

If my micro batch interval is 2 mins, and let's say each microbatch takes only 15 secs to process then shouldn't my checkpoint interval also be exactly 2 mins? 

Assuming my spark streaming application starts at t=0, following will be the state of my checkpoint:

Case 1: checkpoint interval is less than microbatch interval
If I keep my checkpoint interval at say 1 minute then:
t=1m: no incomplete batches in this checkpoint 
t=2m: first microbatch is included as an incomplete microbatch in the checkpoint and microbatch execution then begins
t=3m: no incomplete batches in the checkpoint as the first microbatch is finished processing in just 15 secs
t=4m: second microbatch is included as an incomplete microbatch in the checkpoint and microbatch execution then begins
t=4m30s: system breaks down; on restarting the streaming application finds the checkpoint at t=4 with the second microbatch as the incomplete microbatch and processes it again. But what's the point of reprocessing it again since the second microbatch's processing was completed at the=4m15s

Case 2: checkpoint interval is more than microbatch interval
If I keep my checkpoint interval at say 4 minutes then:
t=2m first microbatch execution begins
t=4m first checkpoint with second microbatch included as the only incomplete batch; second microbatch processing begins
Sub case 1 : system breaks down at t=2m30s : the first microbatch execution was completed at the=2m15s but there is no checkpoint information about this microbatch since the first checkpoint will happen at t=4m. Consequently, when the streaming app restarts it will re-execute by fetching the offsets from Kafka. 
Sub case 2 : system breaks down at t=5m : The second microbatch was already completed in 15 secs i.e. t=4m15s which means at t=5 there should ideally be no incomplete batches. When I restart my application, the streaming application finds the second microbatch as incomplete from the checkpoint made at t=4m, and re-executes that microbatch. 

Is my understanding right? If yes, then isn't my checkpoint interval incorrectly set resulting in duplicate processing in both the cases above? If yes, then how do I choose an appropriate checkpoint interval? 

Regards,
Sheel
Reply | Threaded
Open this post in threaded view
|

Re: Appropriate checkpoint interval in a spark streaming application

sheelstera
Guys any inputs explaining the rationale on the below question will really help. Requesting some expert opinion. 

Regards, 
Sheel

On Sat, 15 Aug, 2020, 1:47 PM Sheel Pancholi, <[hidden email]> wrote:
Hello, 

I am trying to figure an appropriate checkpoint interval for my spark streaming application. Its Spark Kafka integration based on Direct Streams. 

If my micro batch interval is 2 mins, and let's say each microbatch takes only 15 secs to process then shouldn't my checkpoint interval also be exactly 2 mins? 

Assuming my spark streaming application starts at t=0, following will be the state of my checkpoint:

Case 1: checkpoint interval is less than microbatch interval
If I keep my checkpoint interval at say 1 minute then:
t=1m: no incomplete batches in this checkpoint 
t=2m: first microbatch is included as an incomplete microbatch in the checkpoint and microbatch execution then begins
t=3m: no incomplete batches in the checkpoint as the first microbatch is finished processing in just 15 secs
t=4m: second microbatch is included as an incomplete microbatch in the checkpoint and microbatch execution then begins
t=4m30s: system breaks down; on restarting the streaming application finds the checkpoint at t=4 with the second microbatch as the incomplete microbatch and processes it again. But what's the point of reprocessing it again since the second microbatch's processing was completed at the=4m15s

Case 2: checkpoint interval is more than microbatch interval
If I keep my checkpoint interval at say 4 minutes then:
t=2m first microbatch execution begins
t=4m first checkpoint with second microbatch included as the only incomplete batch; second microbatch processing begins
Sub case 1 : system breaks down at t=2m30s : the first microbatch execution was completed at the=2m15s but there is no checkpoint information about this microbatch since the first checkpoint will happen at t=4m. Consequently, when the streaming app restarts it will re-execute by fetching the offsets from Kafka. 
Sub case 2 : system breaks down at t=5m : The second microbatch was already completed in 15 secs i.e. t=4m15s which means at t=5 there should ideally be no incomplete batches. When I restart my application, the streaming application finds the second microbatch as incomplete from the checkpoint made at t=4m, and re-executes that microbatch. 

Is my understanding right? If yes, then isn't my checkpoint interval incorrectly set resulting in duplicate processing in both the cases above? If yes, then how do I choose an appropriate checkpoint interval? 

Regards,
Sheel