how to choose right DStream batch interval

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

how to choose right DStream batch interval

qihong
This post has NOT been accepted by the mailing list yet.
I have some questions regarding DStream batch interval:

1. if it only take 0.5 second to process the batch 99% of time, but 1% of batches need 5 seconds to process (due to some random factor or failures), then what's the right batch interval? 5 seconds (the worst case)?

2. What will happen to DStream processing if 1 batch took longer than batch interval? Can Spark recover from that?

Thanks,
Qihong
Reply | Threaded
Open this post in threaded view
|

Re: how to choose right DStream batch interval

qihong
repost since original msg was marked with "This post has NOT been accepted by the mailing list yet."

I have some questions regarding DStream batch interval:

1. if it only take 0.5 second to process the batch 99% of time, but 1% of batches need 5 seconds to process (due to some random factor or failures), then what's the right batch interval? 5 seconds (the worst case)?

2. What will happen to DStream processing if 1 batch took longer than batch interval? Can Spark recover from that?

Thanks,
Qihong
Reply | Threaded
Open this post in threaded view
|

Re: how to choose right DStream batch interval

Mayur Rustagi
Spark will simply have a backlog of tasks, it'll manage to process them nonetheless, though if it keeps falling behind, you may run out of memory or have unreasonable latency. For momentary spikes, Spark streaming will manage.
Mostly if you are looking to do 100% processing, you'll have to go with 5 sec processing, alternative is to process data in two pipelines (.5 & 5 ) in two spark streaming jobs & overwrite results of one with the other. 

Mayur Rustagi
Ph: +1 (760) 203 3257

On Sat, Sep 6, 2014 at 12:39 AM, qihong <[hidden email]> wrote:
repost since original msg was marked with "This post has NOT been accepted by
the mailing list yet."

I have some questions regarding DStream batch interval:

1. if it only take 0.5 second to process the batch 99% of time, but 1% of
batches need 5 seconds to process (due to some random factor or failures),
then what's the right batch interval? 5 seconds (the worst case)?

2. What will happen to DStream processing if 1 batch took longer than batch
interval? Can Spark recover from that?

Thanks,
Qihong



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13579.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: how to choose right DStream batch interval

qihong
Hi Mayur,

Thanks for your response. I did write a simple test that set up a DStream with
5 batches; The batch duration is 1 second, and the 3rd batch will take extra
2 seconds, the output of the test shows that the 3rd batch causes backlog,
and spark streaming does catch up on 4th and 5th batch (DStream.print
was modified to output system time)

-------------------------------------------
Time: 1409959708000 ms, system time: 1409959708269
-------------------------------------------
1155
-------------------------------------------
Time: 1409959709000 ms, system time: 1409959709033
-------------------------------------------
2255
delay 2000 ms
-------------------------------------------
Time: 1409959710000 ms, system time: 1409959712036
-------------------------------------------
3355
-------------------------------------------
Time: 1409959711000 ms, system time: 1409959712059
-------------------------------------------
4455
-------------------------------------------
Time: 1409959712000 ms, system time: 1409959712083
-------------------------------------------
5555

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: how to choose right DStream batch interval

Tim Smith

On Tue, Sep 9, 2014 at 9:23 PM, qihong <[hidden email]> wrote:
Hi Mayur,

Thanks for your response. I did write a simple test that set up a DStream
with
5 batches; The batch duration is 1 second, and the 3rd batch will take extra
2 seconds, the output of the test shows that the 3rd batch causes backlog,
and spark streaming does catch up on 4th and 5th batch (DStream.print
was modified to output system time)

-------------------------------------------
Time: 1409959708000 ms, system time: 1409959708269
-------------------------------------------
1155
-------------------------------------------
Time: 1409959709000 ms, system time: 1409959709033
-------------------------------------------
2255
delay 2000 ms
-------------------------------------------
Time: 1409959710000 ms, system time: 1409959712036
-------------------------------------------
3355
-------------------------------------------
Time: 1409959711000 ms, system time: 1409959712059
-------------------------------------------
4455
-------------------------------------------
Time: 1409959712000 ms, system time: 1409959712083
-------------------------------------------
5555

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-choose-right-DStream-batch-interval-tp13578p13855.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]