Spark Stateful Streaming - add counter column

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Spark Stateful Streaming - add counter column


I have a a Spark Streaming process that consumes records off a Kafka topic, processes them and sends them to a producer to publish on another topic. I would like to add a sequence number column that can be used to identify records that have the same key and be incremented for each duplicate reoccurence of that key. For example if the output sent to the producer is

Key, col1, col2, seqnum 
A, 67, dog, 1 
B, 56, cat, 1 
C, 89, fish, 1

then if A reoccurs within a reasonable time interval Spark would produce the following:

A, 67, dog, 2 
B, 56, cat, 2

etc. How would I do that ? I suspect that this is a pattern that occurs frequently, but I haven't found any examples.

Sent from my iPhone