General Spark question (streaming)

classic Classic list List threaded Threaded
2 messages Options
od
Reply | Threaded
Open this post in threaded view
|

General Spark question (streaming)

od
Hello,

I am new to spark and have a few questions that are fairly general in nature:

I am trying to set up a real-time data analysis pipeline where I have clients sending events to a collection point (load balanced) and onward the "collectors" send the data to a Spark cluster via zeromq pub/sub (just an experiment).

What do people generally do once they have the data in Spark to enable real-time analytics. Do you store it in some persistent storage and analyze it within some window (let's say the last five minutes) after enough has been aggregated or...?

If I want to count the number of occurrences of an event within a given time frame within a streaming context - does Spark support this and how? General guidelines are OK and any experiences, knowledge and advice is greatly appreciated!

Thanks
Ognen
Reply | Threaded
Open this post in threaded view
|

Re: General Spark question (streaming)

Khanderao kand
1."What do people generally do once they have the data in Spark to enable real-time analytics. Do you store it in some persistent storage and analyze it within some window (let's say the last five minutes) after enough has been aggregated or...?"
>>>It is based on your application. If you have dash boarding / alerting application then you would push the aggregated results to UI / message queue. However, if you want these results to be available for later queries, it would need to be persisted in some storage like HBase.

2. "If I want to count the number of occurrences of an event within a given time frame within a streaming context - does Spark support this and how? "
  >>>Spark supporting windowing, as well as counter.


On Thu, Jan 9, 2014 at 11:07 AM, Ognen Duzlevski <[hidden email]> wrote:
Hello,

I am new to spark and have a few questions that are fairly general in nature:

I am trying to set up a real-time data analysis pipeline where I have clients sending events to a collection point (load balanced) and onward the "collectors" send the data to a Spark cluster via zeromq pub/sub (just an experiment).

What do people generally do once they have the data in Spark to enable real-time analytics. Do you store it in some persistent storage and analyze it within some window (let's say the last five minutes) after enough has been aggregated or...?

If I want to count the number of occurrences of an event within a given time frame within a streaming context - does Spark support this and how? General guidelines are OK and any experiences, knowledge and advice is greatly appreciated!

Thanks
Ognen