Question about 'maxOffsetsPerTrigger'

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about 'maxOffsetsPerTrigger'

Eric Beabes
While running my Spark (Stateful) Structured Streaming job I am setting 'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are processed faster if I use a large value for this property.

What I am also noticing is that until the batch is completely processed, no messages are getting written to the output Kafka topic. The 'State timeout' is set to 10 minutes so I am expecting to see at least some of the messages after 10 minutes or so BUT messages are not getting written until processing of the next batch is started.

Is there any property I can use to kinda 'flush' the messages that are ready to be written? Please let me know. Thanks.

Reply | Threaded
Open this post in threaded view
|

Re: Question about 'maxOffsetsPerTrigger'

Jungtaek Lim-2
As Spark uses micro-batch for streaming, it's unavoidable to adjust the batch size properly to achieve your expectation of throughput vs latency. Especially, Spark uses global watermark which doesn't propagate (change) during micro-batch, you'd want to make the batch relatively small to make watermark move forward faster.

On Wed, Jul 1, 2020 at 2:54 AM Eric Beabes <[hidden email]> wrote:
While running my Spark (Stateful) Structured Streaming job I am setting 'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are processed faster if I use a large value for this property.

What I am also noticing is that until the batch is completely processed, no messages are getting written to the output Kafka topic. The 'State timeout' is set to 10 minutes so I am expecting to see at least some of the messages after 10 minutes or so BUT messages are not getting written until processing of the next batch is started.

Is there any property I can use to kinda 'flush' the messages that are ready to be written? Please let me know. Thanks.