Dataset API Question

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataset API Question

Bernard Jesop
Hello everyone,

I have a question about checkpointing on dataset.

It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD there is no Dataset.isCheckpointed().

I wonder if Dataset.checkpoint is a syntactic sugar for Dataset.rdd.checkpoint.
When I do :

Dataset.checkpoint; Dataset.count
Dataset.rdd.isCheckpointed // result: false

However, when I explicitly do:
Dataset.rdd.checkpoint; Dataset.rdd.count
Dataset.rdd.isCheckpointed // result: true

Could someone explain this behavior to me, or provide some references?

Best regards,
Bernard
Reply | Threaded
Open this post in threaded view
|

Re: Dataset API Question

rxin
It is a bit more than syntactic sugar, but not much more: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L533

BTW this is basically writing all the data out, and then create a new Dataset to load them in.


On Wed, Oct 25, 2017 at 6:51 AM, Bernard Jesop <[hidden email]> wrote:
Hello everyone,

I have a question about checkpointing on dataset.

It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD there is no Dataset.isCheckpointed().

I wonder if Dataset.checkpoint is a syntactic sugar for Dataset.rdd.checkpoint.
When I do :

Dataset.checkpoint; Dataset.count
Dataset.rdd.isCheckpointed // result: false

However, when I explicitly do:
Dataset.rdd.checkpoint; Dataset.rdd.count
Dataset.rdd.isCheckpointed // result: true

Could someone explain this behavior to me, or provide some references?

Best regards,
Bernard

Reply | Threaded
Open this post in threaded view
|

Re: Dataset API Question

Bernard Jesop
In reply to this post by Bernard Jesop
As far as I understand, Dataset.rdd is not the same as InternalRDD.
It is just another RDD representation of the same Dataset and is created on demand (lazy val) when Dataset.rdd is called.
This totally explains the observed behavior.

But how would would it be possible to know that a Dataset have been checkpointed?
Should I manually keep track of that info?

2017-10-25 15:51 GMT+02:00 Bernard Jesop <[hidden email]>:
Hello everyone,

I have a question about checkpointing on dataset.

It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD there is no Dataset.isCheckpointed().

I wonder if Dataset.checkpoint is a syntactic sugar for Dataset.rdd.checkpoint.
When I do :

Dataset.checkpoint; Dataset.count
Dataset.rdd.isCheckpointed // result: false

However, when I explicitly do:
Dataset.rdd.checkpoint; Dataset.rdd.count
Dataset.rdd.isCheckpointed // result: true

Could someone explain this behavior to me, or provide some references?

Best regards,
Bernard

Reply | Threaded
Open this post in threaded view
|

Re: Dataset API Question

Bernard Jesop
Actually, I realized keeping the info would not be enough as I need to find back the checkpoint files to delete them :/

2017-10-25 19:07 GMT+02:00 Bernard Jesop <[hidden email]>:
As far as I understand, Dataset.rdd is not the same as InternalRDD.
It is just another RDD representation of the same Dataset and is created on demand (lazy val) when Dataset.rdd is called.
This totally explains the observed behavior.

But how would would it be possible to know that a Dataset have been checkpointed?
Should I manually keep track of that info?

2017-10-25 15:51 GMT+02:00 Bernard Jesop <[hidden email]>:
Hello everyone,

I have a question about checkpointing on dataset.

It seems in 2.1.0 that there is a Dataset.checkpoint(), however unlike RDD there is no Dataset.isCheckpointed().

I wonder if Dataset.checkpoint is a syntactic sugar for Dataset.rdd.checkpoint.
When I do :

Dataset.checkpoint; Dataset.count
Dataset.rdd.isCheckpointed // result: false

However, when I explicitly do:
Dataset.rdd.checkpoint; Dataset.rdd.count
Dataset.rdd.isCheckpointed // result: true

Could someone explain this behavior to me, or provide some references?

Best regards,
Bernard