Spark Streaming with long batch / window duration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Spark Streaming with long batch / window duration

aaronjosephs
Would it be a reasonable use case of spark streaming to have a very large window size (lets say on the scale of weeks). In this particular case the reduce function would be invertible so that would aid in efficiency. I assume that having a larger batch size since the window is so large would also lighten the workload for spark. The sliding duration is not too important, I just want to know if this is reasonable for spark to handle with any slide duration
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Streaming with long batch / window duration

Tathagata Das
If you want to process data that spans across weeks, then it best to use a dedicated data store (file system, sql / nosql database, etc.) that is designed for long term data storage and retrieval. Spark Streaming is not designed as a long term data store. Also it does not seem like you need low latency. So it might be better to use a combination of Spark Streaming and Spark programs - Spark Streaming to receive data and store it some long term data store, and Spark to periodically (every hour, day?) pull the data from the store and process them. You can implement the invertible function yourself in Spark by storing the previous "reduced values" in the same data store every time the spark program is run, and then using that data the next time. 

The great thing is that both these program can share all the map, and reduce functions. 

TD


On Fri, Jul 18, 2014 at 12:09 PM, aaronjosephs <[hidden email]> wrote:
Would it be a reasonable use case of spark streaming to have a very large
window size (lets say on the scale of weeks). In this particular case the
reduce function would be invertible so that would aid in efficiency. I
assume that having a larger batch size since the window is so large would
also lighten the workload for spark. The sliding duration is not too
important, I just want to know if this is reasonable for spark to handle
with any slide duration



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-long-batch-window-duration-tp10191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Streaming with long batch / window duration

aaronjosephs
Unfortunately for reasons I won't go into my options for what I can use are limited, it was more of a curiosity to see if spark could handle a use case like this since the functionality I wanted fit perfectly into the reduceByKeyAndWindow frame of thinking. Anyway thanks for answering.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Streaming with long batch / window duration

aaronjosephs
In reply to this post by Tathagata Das
So I think  I may end up using hourglass (https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop) a hadoop framework for incremental data processing, it would be very cool if spark (not streaming ) could support something like this
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Spark Streaming with long batch / window duration

emceemouli
This post has NOT been accepted by the mailing list yet.
In reply to this post by Tathagata Das
Thanks. If i not use Window and choose to use Streaming the data on to HDFS, could you suggest how to only store 1 week worth of data. Should i create a cron job to delete HDFS files older than a week. PLease let me know if you have any other suggestions
Loading...