Understanding spark structured streaming checkpointing system

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Understanding spark structured streaming checkpointing system

Ruijing Li
Hi all,

I have a question on how structured streaming does checkpointing. I’m noticing that spark is not reading from the max / latest offset it’s seen. For example, in HDFS, I see it stored offset file 30 which contains partition: offset {1: 2000}

But instead after stopping the job and restarting it, I see it instead reads from offset file 9 which contains {1:1000}

Can someone explain why spark doesn’t take the max offset?
 
Thanks.
--
Cheers,
Ruijing Li
Reply | Threaded
Open this post in threaded view
|

Re: Understanding spark structured streaming checkpointing system

Jungtaek Lim-2
That sounds odd. Is it intermittent, or always reproducible if you starts with same checkpoint? What's the version of Spark?

On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li <[hidden email]> wrote:
Hi all,

I have a question on how structured streaming does checkpointing. I’m noticing that spark is not reading from the max / latest offset it’s seen. For example, in HDFS, I see it stored offset file 30 which contains partition: offset {1: 2000}

But instead after stopping the job and restarting it, I see it instead reads from offset file 9 which contains {1:1000}

Can someone explain why spark doesn’t take the max offset?
 
Thanks.
--
Cheers,
Ruijing Li
Reply | Threaded
Open this post in threaded view
|

Re: Understanding spark structured streaming checkpointing system

Ruijing Li
It’s not intermittent, seems to happen everytime spark fails when it starts up from last checkpoint and complains the offset is old. I checked the offset and it is indeed true the offset expired from kafka side. My version of spark is 2.4.4 using kafka 0.10

On Sun, Apr 19, 2020 at 3:38 PM Jungtaek Lim <[hidden email]> wrote:
That sounds odd. Is it intermittent, or always reproducible if you starts with same checkpoint? What's the version of Spark?

On Fri, Apr 17, 2020 at 6:17 AM Ruijing Li <[hidden email]> wrote:
Hi all,

I have a question on how structured streaming does checkpointing. I’m noticing that spark is not reading from the max / latest offset it’s seen. For example, in HDFS, I see it stored offset file 30 which contains partition: offset {1: 2000}

But instead after stopping the job and restarting it, I see it instead reads from offset file 9 which contains {1:1000}

Can someone explain why spark doesn’t take the max offset?
 
Thanks.
--
Cheers,
Ruijing Li
--
Cheers,
Ruijing Li