spark streaming questions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

spark streaming questions

Liam Stewart
I'm looking at adding spark / shark to our analytics pipeline and would also like to use spark streaming for some incremental computations, but I have some questions about the suitability of spark streaming.

Roughly, we have events that are generated by app servers based on user interactions with our system; these events are protocol buffers and sent through kafka for distribution. For certain events, we'd like to extract information and update a data store, basically just take a value, check against what's already in the store, and update - it seems like storm is more suitable for this task as the events can be processed individually.

We'd also like to update some statistics incrementally. These are currently spec'd as 30 or 90 day rolling windows but could end up being longer. We have O(million) users generating events over these time frames, but the individual streams are very sparse - generally only a few events per day per user at most. Writing roll-ups ourselves with storm is a pain while just running roll-up queries against the database is quite easy, but we're moving away from that as it does generate a non-negligible amount of load that we'd rather avoid.

It seems like spark streaming's windows would fit the bill quite well, but I'm curious how well (if?) such a large number of partitions and long windows are supported.

One other question that I had right now is with the windows - whether or not an input falls outside a window is based on spark's notion of when the input arrives, correct? If our kafka cluster stopped sending data to spark for a non-negligible amount for some reason, it seems that there would be the possibility of including extra data in a window.

Thanks,

liam

--
Liam Stewart :: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: spark streaming questions

Tathagata Das
Responses inline.


On Mon, Feb 3, 2014 at 11:03 AM, Liam Stewart <[hidden email]> wrote:
I'm looking at adding spark / shark to our analytics pipeline and would also like to use spark streaming for some incremental computations, but I have some questions about the suitability of spark streaming.

Roughly, we have events that are generated by app servers based on user interactions with our system; these events are protocol buffers and sent through kafka for distribution. For certain events, we'd like to extract information and update a data store, basically just take a value, check against what's already in the store, and update - it seems like storm is more suitable for this task as the events can be processed individually.

We'd also like to update some statistics incrementally. These are currently spec'd as 30 or 90 day rolling windows but could end up being longer. We have O(million) users generating events over these time frames, but the individual streams are very sparse - generally only a few events per day per user at most. Writing roll-ups ourselves with storm is a pain while just running roll-up queries against the database is quite easy, but we're moving away from that as it does generate a non-negligible amount of load that we'd rather avoid.

It seems like spark streaming's windows would fit the bill quite well, but I'm curious how well (if?) such a large number of partitions and long windows are supported.
 
We have not tested Spark Streaming with very large windows. Though for window may not be the suitable option here. You should take a look at updateStateByKey operation of DStreams, that allows you to maintain arbitrary state and update them with data. That may be more suitable for the purpose of roll ups. You have to write the roll up function for values (i.e. statistics) for each key (i.e., for each user). Take a look at the example - https://github.com/apache/incubator-spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala?source=c
 

One other question that I had right now is with the windows - whether or not an input falls outside a window is based on spark's notion of when the input arrives, correct? If our kafka cluster stopped sending data to spark for a non-negligible amount for some reason, it seems that there would be the possibility of including extra data in a window.

Yes, Spark Streaming's receiver decides which batch a particular records falls into based on when the record arrived. Though I didnt get how you can end up getting *extra* data if kafka stops sending.

TD
 
Thanks,

liam

--
Liam Stewart :: [hidden email]