I have a usecase where I have to stream events from Kafka to a JDBC sink. Kafka producers write events in bursts of hourly batches.
I started with a structured streaming approach, but it turns out that structured streaming has no JDBC sink. I found an implementation in Apache Bahir, but it's buggy and looks abandoned.
So I reimplemented the job using DStreams and everything works fine except that the executors stop consuming anything once they've reached the latest offsets. All future events are discarded. The last INFO level messages are the lines of :
20/11/10 16:19:02 INFO KafkaRDD: Beginning offset 7908480 is the same as ending offset skipping dev_applogs 10
Hier dev_applogs is the topic being consumed and 10 is the partition number.
I played with different versions of "auto.offset.reset" and "enable.auto.commit" but they all lead to the same behaviour. The settings I actually need for my usecase are:
I use spark 2.4.0 and kafka 2.2.1.
Is this the expected behavior ? Shouldn't the spark executors poll the Kafka partitions continuously for new offsets ? This is actually the behaviour with DataStreamReader and it's what I also expected to find with DStreams.