# Q: Discretized Streams: Fault-Tolerant Streaming Computation paper

4 messages
Open this post in threaded view
|

## Q: Discretized Streams: Fault-Tolerant Streaming Computation paper

 I have a question on the following paper "Discretized Streams: Fault-Tolerant Streaming Computation at Scale” written by Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica and available at http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf   Specifically I'm interested in Section 3.2 on page 5 called "Timing Considerations". This section talks about external timestamp. For me I'm looking to use method 2 and correct for late records at the application level. The paper says "[application] could output a new count for time interval [t, t+1) at time t+5, based on the records for this interval received between t and t+5. This computation can be performed with an efficient incremental reduce operation that adds the old counts computed at t+1 to the counts of new records since then, avoiding wasted work."   Q1: If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm recording per minute aggregates, wouldn't the RDD with data which came 24 hours ago be already deleted from disk by Spark? (I'd hope so otherwise it runs out of space)   Q2: The paper talks about "incremental reduce". I'd like to know what it is. I do use reduce so I could get an aggregate of counts. What is this incremental reduce?   Thanks -A
Open this post in threaded view
|

## Re: Q: Discretized Streams: Fault-Tolerant Streaming Computation paper

 It seems StreamingContext has a function:  def remember(duration: Duration) {    graph.remember(duration)  }and in my opinion, incremental reduce means: 1 2 3 4 5 6window_size =5sum_of_first_window = 1+2+3+4+5=15sum_of_second_window_method_1=2+3+4+5+6=20sum_of_second_window_method_2=15+6-1=20 On Wed, Feb 19, 2014 at 3:45 PM, Adrian Mocanu wrote: I have a question on the following paper "Discretized Streams: Fault-Tolerant Streaming Computation at Scale” written by Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica and available at   Specifically I'm interested in Section 3.2 on page 5 called "Timing Considerations". This section talks about external timestamp. For me I'm looking to use method 2 and correct for late records at the application level. The paper says "[application] could output a new count for time interval [t, t+1) at time t+5, based on the records for this interval received between t and t+5. This computation can be performed with an efficient incremental reduce operation that adds the old counts computed at t+1 to the counts of new records since then, avoiding wasted work."   Q1: If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm recording per minute aggregates, wouldn't the RDD with data which came 24 hours ago be already deleted from disk by Spark? (I'd hope so otherwise it runs out of space)   Q2: The paper talks about "incremental reduce". I'd like to know what it is. I do use reduce so I could get an aggregate of counts. What is this incremental reduce?   Thanks -A -- Dachuan HuangCellphone: 614-390-72342015 Neil AvenueOhio State UniversityColumbus, OhioU.S.A.43210