DStreams stop consuming from Kafka

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

DStreams stop consuming from Kafka

Razvan-Daniel Mihai

I have a usecase where I have to stream events from Kafka to a JDBC sink. Kafka producers write events in bursts of hourly batches.

I started with a structured streaming approach, but it turns out that structured streaming has no JDBC sink. I found an implementation in Apache Bahir, but it's buggy and looks abandoned.

So I reimplemented the job using DStreams and everything works fine except that the executors stop consuming anything once they've reached the latest offsets. All future events are discarded. The last INFO level messages are the lines of :

20/11/10 16:19:02 INFO KafkaRDD: Beginning offset 7908480 is the same as ending offset skipping dev_applogs 10

Hier dev_applogs is the topic being consumed and 10 is the partition number.

I played with different versions of "auto.offset.reset" and "enable.auto.commit" but they all lead to the same behaviour. The settings I actually need for my usecase are:


I use spark 2.4.0 and kafka 2.2.1.

Is this the expected behavior ? Shouldn't the spark executors poll the Kafka partitions continuously for new offsets ? This is actually the behaviour with DataStreamReader and it's what I also expected to find with DStreams.

Reply | Threaded
Open this post in threaded view

Re: DStreams stop consuming from Kafka

Maybe you can try the `foreachBatch` API in structured streaming, which
allows reusing existing datasources.

Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

To unsubscribe e-mail: [hidden email]