Query around Spark Checkpoints

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Query around Spark Checkpoints

Debabrata Ghosh
Hi,
    I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ?

Regards,

Debu
Reply | Threaded
Open this post in threaded view
|

Re: Query around Spark Checkpoints

Amit Joshi
Hi,

As far as I know, it depends on whether you are using spark streaming or structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in 
Nosql, as it's eventual consistence.


Regards
Amit

On Sunday, September 27, 2020, Debabrata Ghosh <[hidden email]> wrote:
Hi,
    I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ?

Regards,

Debu
Reply | Threaded
Open this post in threaded view
|

Re: Query around Spark Checkpoints

Jungtaek Lim-2
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala

You would need to implement CheckpointFileManager by yourself, which is tightly integrated with HDFS (parameters and return types of methods are mostly from HDFS). That wouldn't mean it's impossible to implement CheckpointFileManager against a non-filesystem, but it'd be non-trivial to override all of the functionalities and make it work seamlessly.

Required consistency is documented via javadoc of CheckpointFileManager - please go through reading it, and evaluate whether your target storage can fulfill the requirement.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi <[hidden email]> wrote:
Hi,

As far as I know, it depends on whether you are using spark streaming or structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in 
Nosql, as it's eventual consistence.


Regards
Amit

On Sunday, September 27, 2020, Debabrata Ghosh <[hidden email]> wrote:
Hi,
    I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ?

Regards,

Debu
Reply | Threaded
Open this post in threaded view
|

Re: Query around Spark Checkpoints

Debabrata Ghosh
Thank You Jungtaek and Amit ! This is very helpful indeed !

Cheers,

Debu

On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim <[hidden email]> wrote:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala

You would need to implement CheckpointFileManager by yourself, which is tightly integrated with HDFS (parameters and return types of methods are mostly from HDFS). That wouldn't mean it's impossible to implement CheckpointFileManager against a non-filesystem, but it'd be non-trivial to override all of the functionalities and make it work seamlessly.

Required consistency is documented via javadoc of CheckpointFileManager - please go through reading it, and evaluate whether your target storage can fulfill the requirement.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi <[hidden email]> wrote:
Hi,

As far as I know, it depends on whether you are using spark streaming or structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in 
Nosql, as it's eventual consistence.


Regards
Amit

On Sunday, September 27, 2020, Debabrata Ghosh <[hidden email]> wrote:
Hi,
    I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ?

Regards,

Debu
Reply | Threaded
Open this post in threaded view
|

Re: Query around Spark Checkpoints

bryan.jeffrey@gmail.com
Jungtaek,

How would you contrast stateful streaming with checkpoint vs. the idea of writing updates to a Delta Lake table, and then using the Delta Lake table as a streaming source for our state stream?

Thank you,

Bryan 

On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh <[hidden email]> wrote:
Thank You Jungtaek and Amit ! This is very helpful indeed !

Cheers,

Debu

On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim <[hidden email]> wrote:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala

You would need to implement CheckpointFileManager by yourself, which is tightly integrated with HDFS (parameters and return types of methods are mostly from HDFS). That wouldn't mean it's impossible to implement CheckpointFileManager against a non-filesystem, but it'd be non-trivial to override all of the functionalities and make it work seamlessly.

Required consistency is documented via javadoc of CheckpointFileManager - please go through reading it, and evaluate whether your target storage can fulfill the requirement.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi <[hidden email]> wrote:
Hi,

As far as I know, it depends on whether you are using spark streaming or structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in 
Nosql, as it's eventual consistence.


Regards
Amit

On Sunday, September 27, 2020, Debabrata Ghosh <[hidden email]> wrote:
Hi,
    I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ?

Regards,

Debu
Reply | Threaded
Open this post in threaded view
|

Re: Query around Spark Checkpoints

Jungtaek Lim-2
Sorry I have no idea on Delta Lake. You may get a better answer from Delta Lake mailing list.

One thing is clear that stateful processing is simply an essential feature on almost every streaming framework. If you're struggling with something around the state feature and trying to find a workaround then probably something is going wrong. Please feel free to share it.

Thanks,
Jungtaek Lim (HeartSaVioR)

2020년 9월 30일 (수) 오전 1:14, Bryan Jeffrey <[hidden email]>님이 작성:
Jungtaek,

How would you contrast stateful streaming with checkpoint vs. the idea of writing updates to a Delta Lake table, and then using the Delta Lake table as a streaming source for our state stream?

Thank you,

Bryan 

On Mon, Sep 28, 2020 at 9:50 AM Debabrata Ghosh <[hidden email]> wrote:
Thank You Jungtaek and Amit ! This is very helpful indeed !

Cheers,

Debu

On Mon, Sep 28, 2020 at 5:33 AM Jungtaek Lim <[hidden email]> wrote:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CheckpointFileManager.scala

You would need to implement CheckpointFileManager by yourself, which is tightly integrated with HDFS (parameters and return types of methods are mostly from HDFS). That wouldn't mean it's impossible to implement CheckpointFileManager against a non-filesystem, but it'd be non-trivial to override all of the functionalities and make it work seamlessly.

Required consistency is documented via javadoc of CheckpointFileManager - please go through reading it, and evaluate whether your target storage can fulfill the requirement.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Sep 28, 2020 at 3:04 AM Amit Joshi <[hidden email]> wrote:
Hi,

As far as I know, it depends on whether you are using spark streaming or structured streaming.
In spark streaming you can write your own code to checkpoint.
But in case of structured streaming it should be file location.
But main question in why do you want to checkpoint in 
Nosql, as it's eventual consistence.


Regards
Amit

On Sunday, September 27, 2020, Debabrata Ghosh <[hidden email]> wrote:
Hi,
    I had a query around Spark checkpoints - Can I store the checkpoints in NoSQL or Kafka instead of Filesystem ?

Regards,

Debu