[spark streaming] checkpoint location feature for batch processing

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[spark streaming] checkpoint location feature for batch processing

rishishah.star
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [spark streaming] checkpoint location feature for batch processing

Burak Yavuz-2
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [spark streaming] checkpoint location feature for batch processing

rishishah.star
Thanks Burak! Appreciate it. This makes sense. 

How do you suggest we make sure resulting data doesn't produce tiny files? If we are not on databricks yet and can not leverage delta lake features? Also checkpointing feature, do you have active blog/article I can take a look at to try out an example?

On Fri, May 1, 2020 at 7:22 PM Burak Yavuz <[hidden email]> wrote:
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah


--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [spark streaming] checkpoint location feature for batch processing

Molotch
In reply to this post by Burak Yavuz-2
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic.

Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the topic for a particular amount of time?

br,

Magnus

On Sat, May 2, 2020 at 1:22 AM Burak Yavuz <[hidden email]> wrote:
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [spark streaming] checkpoint location feature for batch processing

Jungtaek Lim-2
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well.

It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said, while it enables to get both sides of benefits, it's basically structured streaming, inheriting all the limitations on the structured streaming, compared to the batch query.

Spark 3.0.0 will bring some change on Trigger.once (SPARK-30669 [1]) - Trigger.once will "ignore" the read limit per micro-batch on data source (like maxOffsetsPerTrigger) and process all available input as possible. (Data sources should migrate to the new API to take effect, but works for built-in data sources like file and Kafka.)


2020년 5월 2일 (토) 오후 5:35, Magnus Nilsson <[hidden email]>님이 작성:
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic.

Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the topic for a particular amount of time?

br,

Magnus

On Sat, May 2, 2020 at 1:22 AM Burak Yavuz <[hidden email]> wrote:
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [spark streaming] checkpoint location feature for batch processing

Molotch
Thank you, so that would mean spark gets the current latest offset(s) when the trigger fires and then process all available messages in the topic upto and including that offset as long as maxOffsetsPerTrigger is the default of None (or large enought to handle all available messages).

I think the word micro-batch confused me (more like mega-batch in some cases). It makes sense though, this makes Trigger.Once a fixed interval trigger that's only fired once and not repeatedly.


On Sun, May 3, 2020 at 3:20 AM Jungtaek Lim <[hidden email]> wrote:
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well.

It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said, while it enables to get both sides of benefits, it's basically structured streaming, inheriting all the limitations on the structured streaming, compared to the batch query.

Spark 3.0.0 will bring some change on Trigger.once (SPARK-30669 [1]) - Trigger.once will "ignore" the read limit per micro-batch on data source (like maxOffsetsPerTrigger) and process all available input as possible. (Data sources should migrate to the new API to take effect, but works for built-in data sources like file and Kafka.)


2020년 5월 2일 (토) 오후 5:35, Magnus Nilsson <[hidden email]>님이 작성:
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic.

Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the topic for a particular amount of time?

br,

Magnus

On Sat, May 2, 2020 at 1:22 AM Burak Yavuz <[hidden email]> wrote:
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [spark streaming] checkpoint location feature for batch processing

Jungtaek Lim-2
Replied inline:

On Sun, May 3, 2020 at 6:25 PM Magnus Nilsson <[hidden email]> wrote:
Thank you, so that would mean spark gets the current latest offset(s) when the trigger fires and then process all available messages in the topic upto and including that offset as long as maxOffsetsPerTrigger is the default of None (or large enought to handle all available messages).

Yes it starts from the offset of latest batch. `maxOffsetsPerTrigger` will be ignored starting from Spark 3.0.0, which means for Spark 2.x it's still affecting even Trigger.Once is used I guess.
 

I think the word micro-batch confused me (more like mega-batch in some cases). It makes sense though, this makes Trigger.Once a fixed interval trigger that's only fired once and not repeatedly.

"micro" is relative - though Spark by default processes all available inputs per batch, in most cases you'll want to make the batch size (interval) as small as possible, as it defines the latency of the output. Trigger.Once is an unusual case in streaming workload - that's more alike continuous execution of "batch". I refer "continuous" as picking up latest context which is the characteristic of streaming query, hence hybrid one.
 


On Sun, May 3, 2020 at 3:20 AM Jungtaek Lim <[hidden email]> wrote:
If I understand correctly, Trigger.once executes only one micro-batch and terminates, that's all. Your understanding of structured streaming applies there as well.

It's like a hybrid approach as bringing incremental processing from micro-batch but having processing interval as batch. That said, while it enables to get both sides of benefits, it's basically structured streaming, inheriting all the limitations on the structured streaming, compared to the batch query.

Spark 3.0.0 will bring some change on Trigger.once (SPARK-30669 [1]) - Trigger.once will "ignore" the read limit per micro-batch on data source (like maxOffsetsPerTrigger) and process all available input as possible. (Data sources should migrate to the new API to take effect, but works for built-in data sources like file and Kafka.)


2020년 5월 2일 (토) 오후 5:35, Magnus Nilsson <[hidden email]>님이 작성:
I've always had a question about Trigger.Once that I never got around to ask or test for myself. If you have a 24/7 stream to a Kafka topic.

Will Trigger.Once get the last offset(s) when it starts and then quit once it hits this offset(s) or will the job run until no new messages is added to the topic for a particular amount of time?

br,

Magnus

On Sat, May 2, 2020 at 1:22 AM Burak Yavuz <[hidden email]> wrote:
Hi Rishi,

That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details!

Best,
Burak

On Fri, May 1, 2020 at 2:55 PM Rishi Shah <[hidden email]> wrote:
Hi All,

I recently started playing with spark streaming, and checkpoint location feature looks very promising. I wonder if anyone has an opinion about using spark streaming with checkpoint location option as a slow batch processing solution. What would be the pros and cons of utilizing streaming with checkpoint location feature to achieve fault tolerance in batch processing application?

--
Regards,

Rishi Shah