Spark 3.0.1 Structured streaming - checkpoints fail

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0.1 Structured streaming - checkpoints fail

aldu29
Hello,

I have an issue with my Pyspark job related to checkpoint.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not exist

This job is based on Spark 3.0.1 and Structured Streaming
This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage an hdfs cluster if possible.
Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds ?

Thanks in advance
David
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 Structured streaming - checkpoints fail

Lalwani, Jayesh

Yes. It is necessary to have a distributed file system because all the workers need to read/write to the checkpoint. The distributed file system has to be immediately consistent: When one node writes to it, the other nodes should be able to read it immediately

The solutions/workarounds depend on where you are hosting your Spark application.

 

From: David Morin <[hidden email]>
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hello,

 

I have an issue with my Pyspark job related to checkpoint.

 

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not exist

 

This job is based on Spark 3.0.1 and Structured Streaming

This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage an hdfs cluster if possible.

Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds ?

 

Thanks in advance

David

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 Structured streaming - checkpoints fail

aldu29
Thanks.
My Spark applications run on nodes based on docker images but this is a standalone mode (1 driver - n workers)
Can we use S3 directly with consistency addon like s3guard (s3a) or AWS Consistent view ?

Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh <[hidden email]> a écrit :

Yes. It is necessary to have a distributed file system because all the workers need to read/write to the checkpoint. The distributed file system has to be immediately consistent: When one node writes to it, the other nodes should be able to read it immediately

The solutions/workarounds depend on where you are hosting your Spark application.

 

From: David Morin <[hidden email]>
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hello,

 

I have an issue with my Pyspark job related to checkpoint.

 

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not exist

 

This job is based on Spark 3.0.1 and Structured Streaming

This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage an hdfs cluster if possible.

Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds ?

 

Thanks in advance

David

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 Structured streaming - checkpoints fail

aldu29
Does it work with the standard AWS S3 solution and its new consistency model ?

Le mer. 23 déc. 2020 à 18:48, David Morin <[hidden email]> a écrit :
Thanks.
My Spark applications run on nodes based on docker images but this is a standalone mode (1 driver - n workers)
Can we use S3 directly with consistency addon like s3guard (s3a) or AWS Consistent view ?

Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh <[hidden email]> a écrit :

Yes. It is necessary to have a distributed file system because all the workers need to read/write to the checkpoint. The distributed file system has to be immediately consistent: When one node writes to it, the other nodes should be able to read it immediately

The solutions/workarounds depend on where you are hosting your Spark application.

 

From: David Morin <[hidden email]>
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hello,

 

I have an issue with my Pyspark job related to checkpoint.

 

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not exist

 

This job is based on Spark 3.0.1 and Structured Streaming

This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage an hdfs cluster if possible.

Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds ?

 

Thanks in advance

David

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 Structured streaming - checkpoints fail

Jungtaek Lim-2
Probably we may want to add it in the SS guide doc. We didn't need it as it just didn't work with eventually consistent model, and now it works anyway but is very inefficient.


On Thu, Dec 24, 2020 at 6:16 AM David Morin <[hidden email]> wrote:
Does it work with the standard AWS S3 solution and its new consistency model ?

Le mer. 23 déc. 2020 à 18:48, David Morin <[hidden email]> a écrit :
Thanks.
My Spark applications run on nodes based on docker images but this is a standalone mode (1 driver - n workers)
Can we use S3 directly with consistency addon like s3guard (s3a) or AWS Consistent view ?

Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh <[hidden email]> a écrit :

Yes. It is necessary to have a distributed file system because all the workers need to read/write to the checkpoint. The distributed file system has to be immediately consistent: When one node writes to it, the other nodes should be able to read it immediately

The solutions/workarounds depend on where you are hosting your Spark application.

 

From: David Morin <[hidden email]>
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hello,

 

I have an issue with my Pyspark job related to checkpoint.

 

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not exist

 

This job is based on Spark 3.0.1 and Structured Streaming

This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage an hdfs cluster if possible.

Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds ?

 

Thanks in advance

David

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0.1 Structured streaming - checkpoints fail

aldu29
Thanks Jungtaek
Ok I got it. I'll test it and check if the loss of efficiency is acceptable.
 

Le mer. 23 déc. 2020 à 23:29, Jungtaek Lim <[hidden email]> a écrit :
Probably we may want to add it in the SS guide doc. We didn't need it as it just didn't work with eventually consistent model, and now it works anyway but is very inefficient.


On Thu, Dec 24, 2020 at 6:16 AM David Morin <[hidden email]> wrote:
Does it work with the standard AWS S3 solution and its new consistency model ?

Le mer. 23 déc. 2020 à 18:48, David Morin <[hidden email]> a écrit :
Thanks.
My Spark applications run on nodes based on docker images but this is a standalone mode (1 driver - n workers)
Can we use S3 directly with consistency addon like s3guard (s3a) or AWS Consistent view ?

Le mer. 23 déc. 2020 à 17:48, Lalwani, Jayesh <[hidden email]> a écrit :

Yes. It is necessary to have a distributed file system because all the workers need to read/write to the checkpoint. The distributed file system has to be immediately consistent: When one node writes to it, the other nodes should be able to read it immediately

The solutions/workarounds depend on where you are hosting your Spark application.

 

From: David Morin <[hidden email]>
Date: Wednesday, December 23, 2020 at 11:08 AM
To: "[hidden email]" <[hidden email]>
Subject: [EXTERNAL] Spark 3.0.1 Structured streaming - checkpoints fail

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Hello,

 

I have an issue with my Pyspark job related to checkpoint.

 

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 16997.0 failed 4 times, most recent failure: Lost task 3.3 in stage 16997.0 (TID 206609, 10.XXX, executor 4): java.lang.IllegalStateException: Error reading delta file file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta of HDFSStateStoreProvider[id = (op=0,part=3),dir = file:/opt/spark/workdir/query6/checkpointlocation/state/0/3]: file:/opt/spark/workdir/query6/checkpointlocation/state/0/3/1.delta does not exist

 

This job is based on Spark 3.0.1 and Structured Streaming

This Spark cluster (1 driver and 6 executors) works without hdfs. And we don't want to manage an hdfs cluster if possible.

Is it necessary to have a distributed filesystem ? What are the different solutions/workarounds ?

 

Thanks in advance

David