Lazy Spark Structured Streaming

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Lazy Spark Structured Streaming

Phillip Henry
Hi, folks.

I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed.

There is a JIRA for this at:


that says it's resolved in 2.4.0 but my code is using 2.4.2 yet I still see Spark reluctant to consume another batch from Kafka if it means there is nothing else waiting to be processed in the topic.

Do I have to do something special to exploit the behaviour that SPARK-24156 says it has addressed?

Regards,

Phillip



Reply | Threaded
Open this post in threaded view
|

Re: Lazy Spark Structured Streaming

Phillip Henry
Sorry, should have mentioned that Spark only seems reluctant to take the last windowed, groupBy batch from Kafka when using OutputMode.Append

I've asked on StackOverflow:
but am still struggling. Can anybody please help?

How do people test their SSS code if you have to put a message on Kafka to get Spark to consume a batch?

Kind regards,

Phillip


On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry <[hidden email]> wrote:
Hi, folks.

I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed.

There is a JIRA for this at:


that says it's resolved in 2.4.0 but my code is using 2.4.2 yet I still see Spark reluctant to consume another batch from Kafka if it means there is nothing else waiting to be processed in the topic.

Do I have to do something special to exploit the behaviour that SPARK-24156 says it has addressed?

Regards,

Phillip



Reply | Threaded
Open this post in threaded view
|

Re: Lazy Spark Structured Streaming

Jungtaek Lim-2
I'm not sure what exactly your problem is, but given you've mentioned window and OutputMode.Append, you may want to remind that append mode doesn't produce the output of aggregation unless the watermark "passes by". It's expected behavior if you're seeing lazy outputs on OutputMode.Append compared to OutputMode.Update.

Unfortunately there's no mechanism on SSS to move forward only watermark without actual input, so if you want to test some behavior on OutputMode.Append you would need to add a dummy record to move watermark forward.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry <[hidden email]> wrote:
Sorry, should have mentioned that Spark only seems reluctant to take the last windowed, groupBy batch from Kafka when using OutputMode.Append

I've asked on StackOverflow:
but am still struggling. Can anybody please help?

How do people test their SSS code if you have to put a message on Kafka to get Spark to consume a batch?

Kind regards,

Phillip


On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry <[hidden email]> wrote:
Hi, folks.

I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed.

There is a JIRA for this at:


that says it's resolved in 2.4.0 but my code is using 2.4.2 yet I still see Spark reluctant to consume another batch from Kafka if it means there is nothing else waiting to be processed in the topic.

Do I have to do something special to exploit the behaviour that SPARK-24156 says it has addressed?

Regards,

Phillip



Reply | Threaded
Open this post in threaded view
|

Re: Lazy Spark Structured Streaming

Phillip Henry
Thanks, Jungtaek. Very useful information.

Could I please trouble you with one further question - what you said makes perfect sense but to what exactly does SPARK-24156 refer if not fixing the "need to add a dummy record to move watermark forward"?

Kind regards,

Phillip




On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim <[hidden email]> wrote:
I'm not sure what exactly your problem is, but given you've mentioned window and OutputMode.Append, you may want to remind that append mode doesn't produce the output of aggregation unless the watermark "passes by". It's expected behavior if you're seeing lazy outputs on OutputMode.Append compared to OutputMode.Update.

Unfortunately there's no mechanism on SSS to move forward only watermark without actual input, so if you want to test some behavior on OutputMode.Append you would need to add a dummy record to move watermark forward.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry <[hidden email]> wrote:
Sorry, should have mentioned that Spark only seems reluctant to take the last windowed, groupBy batch from Kafka when using OutputMode.Append

I've asked on StackOverflow:
but am still struggling. Can anybody please help?

How do people test their SSS code if you have to put a message on Kafka to get Spark to consume a batch?

Kind regards,

Phillip


On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry <[hidden email]> wrote:
Hi, folks.

I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed.

There is a JIRA for this at:


that says it's resolved in 2.4.0 but my code is using 2.4.2 yet I still see Spark reluctant to consume another batch from Kafka if it means there is nothing else waiting to be processed in the topic.

Do I have to do something special to exploit the behaviour that SPARK-24156 says it has addressed?

Regards,

Phillip



Reply | Threaded
Open this post in threaded view
|

Re: Lazy Spark Structured Streaming

Jungtaek Lim-2
SPARK-24156 runs the no-data batch to apply the updated watermark, but the updated watermark may not be eligible to evict all state rows. (e.g. window, lateness of watermark)
You'll still need to provide dummy input record to advance watermark, so that all expected state rows can be evicted.

On Sun, Aug 2, 2020 at 5:44 PM Phillip Henry <[hidden email]> wrote:
Thanks, Jungtaek. Very useful information.

Could I please trouble you with one further question - what you said makes perfect sense but to what exactly does SPARK-24156 refer if not fixing the "need to add a dummy record to move watermark forward"?

Kind regards,

Phillip




On Mon, Jul 27, 2020 at 11:41 PM Jungtaek Lim <[hidden email]> wrote:
I'm not sure what exactly your problem is, but given you've mentioned window and OutputMode.Append, you may want to remind that append mode doesn't produce the output of aggregation unless the watermark "passes by". It's expected behavior if you're seeing lazy outputs on OutputMode.Append compared to OutputMode.Update.

Unfortunately there's no mechanism on SSS to move forward only watermark without actual input, so if you want to test some behavior on OutputMode.Append you would need to add a dummy record to move watermark forward.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jul 27, 2020 at 8:10 PM Phillip Henry <[hidden email]> wrote:
Sorry, should have mentioned that Spark only seems reluctant to take the last windowed, groupBy batch from Kafka when using OutputMode.Append

I've asked on StackOverflow:
but am still struggling. Can anybody please help?

How do people test their SSS code if you have to put a message on Kafka to get Spark to consume a batch?

Kind regards,

Phillip


On Sun, Jul 12, 2020 at 4:55 PM Phillip Henry <[hidden email]> wrote:
Hi, folks.

I noticed that SSS won't process a waiting batch if there are no batches after that. To put it another way, Spark must always leave one batch on Kafka waiting to be consumed.

There is a JIRA for this at:


that says it's resolved in 2.4.0 but my code is using 2.4.2 yet I still see Spark reluctant to consume another batch from Kafka if it means there is nothing else waiting to be processed in the topic.

Do I have to do something special to exploit the behaviour that SPARK-24156 says it has addressed?

Regards,

Phillip