Is there a difference between df.cache() vs df.rdd.cache()

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there a difference between df.cache() vs df.rdd.cache()

Supun Nakandala
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun
Reply | Threaded
Open this post in threaded view
|

Re: Is there a difference between df.cache() vs df.rdd.cache()

Weichen Xu
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <[hidden email]> wrote:
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun

Reply | Threaded
Open this post in threaded view
|

Re: Is there a difference between df.cache() vs df.rdd.cache()

Supun Nakandala
Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation under the covers though it presents a different API. And calling df.rdd will simply give access to the underlying RDD. Is this assumption wrong? I would appreciate if you can shed more insights on this issue or point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <[hidden email]> wrote:
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <[hidden email]> wrote:
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun


Reply | Threaded
Open this post in threaded view
|

Re: Is there a difference between df.cache() vs df.rdd.cache()

Vadim Semenov
When you do `Dataset.rdd` you actually create a new job




On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <[hidden email]> wrote:
Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation under the covers though it presents a different API. And calling df.rdd will simply give access to the underlying RDD. Is this assumption wrong? I would appreciate if you can shed more insights on this issue or point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <[hidden email]> wrote:
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <[hidden email]> wrote:
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun



Reply | Threaded
Open this post in threaded view
|

Re: Is there a difference between df.cache() vs df.rdd.cache()

Stephen Boesch
@Vadim   Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint?   If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been calculated.  In that case would it also be true that a subsequent action/checkpoint on the DataFrame (not the rdd) would then generate a separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov <[hidden email]>:
When you do `Dataset.rdd` you actually create a new job




On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <[hidden email]> wrote:
Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation under the covers though it presents a different API. And calling df.rdd will simply give access to the underlying RDD. Is this assumption wrong? I would appreciate if you can shed more insights on this issue or point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <[hidden email]> wrote:
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <[hidden email]> wrote:
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun




Reply | Threaded
Open this post in threaded view
|

Re: Is there a difference between df.cache() vs df.rdd.cache()

Weichen Xu
Hi Supun,

Dataframe API is NOT using the old RDD implementation under the covers, dataframe has its own implementation. (Dataframe use binary row format and columnar storage when cached). So dataframe has no relationship with the `RDD[Row]` you want get.

When calling `df.rdd`, and then cache, it need to turn this dataframe into rdd, it will extract each row from dataframe, unserialize them, and compose the new RDD.

Thanks!

On Sat, Oct 14, 2017 at 6:17 AM, Stephen Boesch <[hidden email]> wrote:
@Vadim   Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint?   If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been calculated.  In that case would it also be true that a subsequent action/checkpoint on the DataFrame (not the rdd) would then generate a separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov <[hidden email]>:
When you do `Dataset.rdd` you actually create a new job




On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <[hidden email]> wrote:
Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation under the covers though it presents a different API. And calling df.rdd will simply give access to the underlying RDD. Is this assumption wrong? I would appreciate if you can shed more insights on this issue or point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <[hidden email]> wrote:
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <[hidden email]> wrote:
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun





Reply | Threaded
Open this post in threaded view
|

Re: Is there a difference between df.cache() vs df.rdd.cache()

Supun Nakandala
Hi Weichen,

Thank you very much for the explanation.

On Fri, Oct 13, 2017 at 6:56 PM, Weichen Xu <[hidden email]> wrote:
Hi Supun,

Dataframe API is NOT using the old RDD implementation under the covers, dataframe has its own implementation. (Dataframe use binary row format and columnar storage when cached). So dataframe has no relationship with the `RDD[Row]` you want get.

When calling `df.rdd`, and then cache, it need to turn this dataframe into rdd, it will extract each row from dataframe, unserialize them, and compose the new RDD.

Thanks!

On Sat, Oct 14, 2017 at 6:17 AM, Stephen Boesch <[hidden email]> wrote:
@Vadim   Would it be true to say the `.rdd` *may* be creating a new job - depending on whether the DataFrame/DataSet had already been materialized via an action or checkpoint?   If the only prior operations on the DataFrame had been transformations then the dataframe would still not have been calculated.  In that case would it also be true that a subsequent action/checkpoint on the DataFrame (not the rdd) would then generate a separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov <[hidden email]>:
When you do `Dataset.rdd` you actually create a new job




On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <[hidden email]> wrote:
Hi Weichen,

Thank you for the reply.

My understanding was Dataframe API is using the old RDD implementation under the covers though it presents a different API. And calling df.rdd will simply give access to the underlying RDD. Is this assumption wrong? I would appreciate if you can shed more insights on this issue or point me to documentation where I can learn them.

Thank you in advance.

On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu <[hidden email]> wrote:
You should use `df.cache()`
`df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from the original `df`. and then cache the new RDD.

On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <[hidden email]> wrote:
Hi all,

I have been experimenting with cache/persist/unpersist methods with respect to both Dataframes and RDD APIs. However, I am experiencing different behaviors Ddataframe API compared RDD API such Dataframes are not getting cached when count() is called. 

Is there a difference between how these operations act wrt to Dataframe and RDD APIs?

Thank You.
-Supun