Long term arbitrary stateful processing - best practices

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Long term arbitrary stateful processing - best practices

monohusche
Hi there,

We are currently evaluating Spark to provide streaming analytics for
unbounded data sets. Basically creating aggregations over event time windows
while supporting both early and late/out of order messages. For streaming
aggregations (=SQL operations with well-defined semantics, e.g. count, avg),
I think, it is all pretty clear.

My question is more regarding arbitrary stateful processing, especially
mapGroupsWithState.

The scenario is as follows: We have continuously incoming purchase order
messages (via Kafka), every message referencing a user. Firstly, the status
of the referenced user needs to be updated (=active). If no purchase within
a year, he/she is deemed inactive (i.e. the timeout case).

On top of that, every month, we need to calculate the aggregate on top of
these user states, i.e. the percentage of active users. This deck
(https://www.slideshare.net/databricks/a-deep-dive-into-stateful-stream-processing-in-structured-streaming-with-tathagata-das)
describes a similar scenario from page 45.

The issue in our case is how to manage such arbitrary state at scale. On one
hand, we have a large number of users, at the same time, the window (to keep
the intermediate state) is very large (=a whole year).

Are there alternative patterns for implementing such a requirement?

By the way, I am quite new to Spark Structured Streaming, so primarily
looking for pointers to avoid barking up the wrong tree.

thanks in advance, Nick


 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]