Long term arbitrary stateful processing - best practices
We are currently evaluating Spark to provide streaming analytics for
unbounded data sets. Basically creating aggregations over event time windows
while supporting both early and late/out of order messages. For streaming
aggregations (=SQL operations with well-defined semantics, e.g. count, avg),
I think, it is all pretty clear.
My question is more regarding arbitrary stateful processing, especially
The scenario is as follows: We have continuously incoming purchase order
messages (via Kafka), every message referencing a user. Firstly, the status
of the referenced user needs to be updated (=active). If no purchase within
a year, he/she is deemed inactive (i.e. the timeout case).
The issue in our case is how to manage such arbitrary state at scale. On one
hand, we have a large number of users, at the same time, the window (to keep
the intermediate state) is very large (=a whole year).
Are there alternative patterns for implementing such a requirement?
By the way, I am quite new to Spark Structured Streaming, so primarily
looking for pointers to avoid barking up the wrong tree.