We're currently using Spark streaming (Spark 2.3) for a number of applications. One pattern we've used successfully is to generate an accumulator inside a DStream transform statement. We then accumulate values associated with the RDD as we process the data. A stage completion listener that listens for stage complete events, retrieves the AccumulableInfo for our custom classes and exhausts the statistics to our back-end.
We're trying to move more of our applications to using Structured Streaming. However, the accumulator pattern does not seem to obviously fit Structured Streaming. In many cases we're able to see basic statistics (e.g. # input and # output events) from the built-in statistics. We need to determine a pattern for more complex statistics (# errors, # of internal records, etc). Defining an accumulator on startup and adding statistics, we're able to see the statistics - but only updates - so if we read 10 records in the first trigger, and 15 in the second trigger we see accumulated values of 10, 25.
There are several options that might allow us to move ahead:
1. We could have the AccumulableInfo contain previous counts and current counts
2. We could maintain current and previous counts separately
3. We could maintain a list of ID to AccumulatorV2 and then call accumulator.reset() once we've read data
All of these options seem a little bit like a hacky workaround. Has anyone encountered this use-case? Is there a good pattern to follow?