Streaming: getting total count over all windows

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
SK
Reply | Threaded
Open this post in threaded view
|

Streaming: getting total count over all windows

SK
Hi,

I am using the following code to generate the (score, count) for each window:

val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer score
                                                     .countByValue()
                           
score_count_by_window.print()  

E.g. output for a window is as follows, which means that within the Dstream for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with score -1.
(0, 2)
(1, 3)
(-1, 1)

I would like to get the aggregate count for each score over all windows until program terminates. I tried countByValueAndWindow() but the result is same as countByValue() (i.e. it is producing only per window counts).  reduceByWindow also does not produce the result I am expecting. What is the correct way to sum up the counts over multiple windows?

thanks






Reply | Threaded
Open this post in threaded view
|

Re: Streaming: getting total count over all windows

jay vyas-2
I would think this should be done at the application level. 
After all, the core functionality of SparkStreaming is to capture RDDs in some real time interval and process them -
not to aggregate their results.

But maybe there is a better way.......

On Thu, Nov 13, 2014 at 8:28 PM, SK <[hidden email]> wrote:
Hi,

I am using the following code to generate the (score, count) for each
window:

val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
score
                                                     .countByValue()

score_count_by_window.print()

E.g. output for a window is as follows, which means that within the Dstream
for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
score -1.
(0, 2)
(1, 3)
(-1, 1)

I would like to get the aggregate count for each score over all windows
until program terminates. I tried countByValueAndWindow() but the result is
same as countByValue() (i.e. it is producing only per window counts).
reduceByWindow also does not produce the result I am expecting. What is the
correct way to sum up the counts over multiple windows?

thanks










--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp18888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
jay vyas
Reply | Threaded
Open this post in threaded view
|

Re: Streaming: getting total count over all windows

Mayur Rustagi
So if you want to do from beginning to end of time the interface is updateStatebykey, if only over a particular set of windows you can construct broader windows from smaller windows/batches. 

Mayur Rustagi
Ph: +1 (760) 203 3257

On Fri, Nov 14, 2014 at 9:17 AM, jay vyas <[hidden email]> wrote:
I would think this should be done at the application level. 
After all, the core functionality of SparkStreaming is to capture RDDs in some real time interval and process them -
not to aggregate their results.

But maybe there is a better way.......

On Thu, Nov 13, 2014 at 8:28 PM, SK <[hidden email]> wrote:
Hi,

I am using the following code to generate the (score, count) for each
window:

val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
score
                                                     .countByValue()

score_count_by_window.print()

E.g. output for a window is as follows, which means that within the Dstream
for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
score -1.
(0, 2)
(1, 3)
(-1, 1)

I would like to get the aggregate count for each score over all windows
until program terminates. I tried countByValueAndWindow() but the result is
same as countByValue() (i.e. it is producing only per window counts).
reduceByWindow also does not produce the result I am expecting. What is the
correct way to sum up the counts over multiple windows?

thanks










--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp18888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
jay vyas