[Spark Streaming] support of non-timebased windows and lag function

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Spark Streaming] support of non-timebased windows and lag function

0vbb

Hi,

 

I have a question regarding Spark structured streaming:

will non-timebased window operations like the lag function be supported at some point, or is this not on the table due to technical difficulties?

 

I.e. will something like this be possible in the future:

w = Window.partitionBy('uid').orderBy('timestamp')

df = df.withColumn('lag', lag(df['col_A']).over(w))

 

This would be useful in streaming applications, where we look for “patterns” based on the occurrence of multiple events (rows) in a particular order.

A different way to achieve the above functionality, while being more general, would be to use joins, but for this a streaming-ready, monotonically increasing and (!) concurrent uid would be needed.

 

Thanks a lot & best,

Michael