How to reduceByKeyAndWindow in Structured Streaming?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to reduceByKeyAndWindow in Structured Streaming?

oripwk
This post was updated on .
In Structured Streaming, there's the notion of event-time windowing:

  words.groupBy(
    window($"timestamp", "10 minutes", "5 minutes"),
    $"word"
  ).count()


However, this is not quite similar to DStream's windowing operations: in
Structured Streaming, windowing groups the data by fixed time-windows, and
every event in a time window is associated to its group:

  12:00-12:05 [banana, apple]
  12:05-12:10 [orange]
  …


And in DStreams it just outputs all the data according to a limited window
in time (last 10 minutes for example).

The question was asked also  here
<https://stackoverflow.com/questions/49821646/is-there-someway-to-do-the-eqivalent-of-reducebykeyandwindow-in-spark-structured
, if it makes it clearer.

How the latter can be achieved in Structured Streaming?

Edit:
<raws> tags didn't work, so changed to regular text

--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Reply | Threaded
Open this post in threaded view
|

Re: How to reduceByKeyAndWindow in Structured Streaming?

maasg
Hi,

In Structured Streaming lingo, "ReduceByKeyAndWindow" would be a window aggregation with a composite key.
Something like:
stream.groupBy($"key", window($"timestamp", "5 minutes"))
           .agg(sum($"value") as "total")

The aggregate could be any supported SQL function.
Is this what you are looking for? Otherwise, share your specific use case to see how it could be implemented in Structured Streaming.

kr, Gerard.

On Thu, Jun 28, 2018 at 10:21 AM oripwk <[hidden email]> wrote:
In Structured Streaming, there's the notion of event-time windowing:



However, this is not quite similar to DStream's windowing operations: in
Structured Streaming, windowing groups the data by fixed time-windows, and
every event in a time window is associated to its group:


And in DStreams it just outputs all the data according to a limited window
in time (last 10 minutes for example).

The question was asked also  here
<https://stackoverflow.com/questions/49821646/is-there-someway-to-do-the-eqivalent-of-reducebykeyandwindow-in-spark-structured>
, if it makes it clearer.

How the latter can be achieved in Structured Streaming?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to reduceByKeyAndWindow in Structured Streaming?

Tathagata Das
The fundamental conceptual difference between the windowing in DStream vs Structured Streaming is that DStream used the arrival time of the record in Spark (aka processing time) and Structured Streaming using event time. If you want to exactly replicate DStream's processing time windows in Structured Streaming, then you an just add the current timestamp as an additional column in the DataFrame and group by using that. 

streamingDF
    .withColumn("processing_time", current_timestamp())
    .groupBy($"key", window($"processing_time", "5 minutes"))
    .agg(sum($"value") as "total")


On Thu, Jun 28, 2018 at 2:24 AM, Gerard Maas <[hidden email]> wrote:
Hi,

In Structured Streaming lingo, "ReduceByKeyAndWindow" would be a window aggregation with a composite key.
Something like:
stream.groupBy($"key", window($"timestamp", "5 minutes"))
           .agg(sum($"value") as "total")

The aggregate could be any supported SQL function.
Is this what you are looking for? Otherwise, share your specific use case to see how it could be implemented in Structured Streaming.

kr, Gerard.

On Thu, Jun 28, 2018 at 10:21 AM oripwk <[hidden email]> wrote:
In Structured Streaming, there's the notion of event-time windowing:



However, this is not quite similar to DStream's windowing operations: in
Structured Streaming, windowing groups the data by fixed time-windows, and
every event in a time window is associated to its group:


And in DStreams it just outputs all the data according to a limited window
in time (last 10 minutes for example).

The question was asked also  here
<https://stackoverflow.com/questions/49821646/is-there-someway-to-do-the-eqivalent-of-reducebykeyandwindow-in-spark-structured>
, if it makes it clearer.

How the latter can be achieved in Structured Streaming?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: How to reduceByKeyAndWindow in Structured Streaming?

oripwk
Thanks guys, it really helps.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]