Caching

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Caching

Amit Sharma-2
Hi All, I am using caching in my code. I have a DF like
val  DF1 = read csv.
val DF2 = DF1.groupBy().agg().select(.....)

Val DF3 =  read csv .join(DF1).join(DF2)
  DF3 .save.

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

Thanks
Amit


Reply | Threaded
Open this post in threaded view
|

RE: Caching

Theodoros Gkountouvas

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Caching

Amit Sharma-2
Thanks for the information. I am using  spark 2.3.3 There are few more questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <[hidden email]> wrote:

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Caching

srowen
No, it's not true that one action means every DF is evaluated once. This is a good counterexample. 

On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma <[hidden email]> wrote:
Thanks for the information. I am using  spark 2.3.3 There are few more questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <[hidden email]> wrote:

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Caching

Lalwani, Jayesh
In reply to this post by Amit Sharma-2

Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching,  Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once.

 

You might want to look at doing a windowed  query on DF1 to avoid joining DF1 with DF2. This should give you better or similar  performance when compared to  cache because Spark will optimize for cache the data during shuffle.

 

From: Amit Sharma <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, December 7, 2020 at 12:47 PM
To: Theodoros Gkountouvas <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: RE: [EXTERNAL] Caching

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Thanks for the information. I am using  spark 2.3.3 There are few more questions

 

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

 

I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong.

 

Thanks

Amit

 

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <[hidden email]> wrote:

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Caching

Amit Sharma-2
In reply to this post by srowen
Sean, you mean if df  is used more than once in transformation then use cache. But be frankly that is also not true because at many places even if df is used once with caching and without cache also it gives same result. How to decide should we use cache or not


Thanks
Amit

On Mon, Dec 7, 2020 at 1:01 PM Sean Owen <[hidden email]> wrote:
No, it's not true that one action means every DF is evaluated once. This is a good counterexample. 

On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma <[hidden email]> wrote:
Thanks for the information. I am using  spark 2.3.3 There are few more questions

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong.

Thanks
Amit

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <[hidden email]> wrote:

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Caching

Amit Sharma-2
In reply to this post by Lalwani, Jayesh
Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query.


Thanks
Amit

On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh <[hidden email]> wrote:

Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching,  Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once.

 

You might want to look at doing a windowed  query on DF1 to avoid joining DF1 with DF2. This should give you better or similar  performance when compared to  cache because Spark will optimize for cache the data during shuffle.

 

From: Amit Sharma <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, December 7, 2020 at 12:47 PM
To: Theodoros Gkountouvas <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: RE: [EXTERNAL] Caching

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Thanks for the information. I am using  spark 2.3.3 There are few more questions

 

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

 

I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong.

 

Thanks

Amit

 

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <[hidden email]> wrote:

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Caching

"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"
In reply to this post by Amit Sharma-2
You are using same csv twice?

Отправлено с iPhone

> 7 дек. 2020 г., в 18:32, Amit Sharma <[hidden email]> написал(а):
>
> 
> Hi All, I am using caching in my code. I have a DF like
> val  DF1 = read csv.
> val DF2 = DF1.groupBy().agg().select(.....)
>
> Val DF3 =  read csv .join(DF1).join(DF2)
>   DF3 .save.
>
> If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.
>
> Thanks
> Amit
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Caching

Lalwani, Jayesh
In reply to this post by Amit Sharma-2
  • Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query.

 

No. That would mean that Spark will need to cache DF1. Spark won’t cache dataframes unless you ask it to, even if it knows that the same dataframe is being used twice. This is because caching data frames introduces memory overheads, and it’s not going to prematurely do it. It will combine processing of various dataframes within a stage. However, in your case, you are doing aggregation which will create a new stage

You can check the execution plan if you like

 

From: Amit Sharma <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, December 7, 2020 at 1:47 PM
To: "Lalwani, Jayesh" <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: RE: [EXTERNAL] Caching

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query.

 

 

Thanks

Amit

 

On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh <[hidden email]> wrote:

Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching,  Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once.

 

You might want to look at doing a windowed  query on DF1 to avoid joining DF1 with DF2. This should give you better or similar  performance when compared to  cache because Spark will optimize for cache the data during shuffle.

 

From: Amit Sharma <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Monday, December 7, 2020 at 12:47 PM
To: Theodoros Gkountouvas <[hidden email]>, "[hidden email]" <[hidden email]>
Subject: RE: [EXTERNAL] Caching

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Thanks for the information. I am using  spark 2.3.3 There are few more questions

 

1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation.

 

I believe even if we use a dataframe multiple times for transformation , use caching should be based on actions. In my case action is one save call on DF3. Please correct me if i am wrong.

 

Thanks

Amit

 

On Mon, Dec 7, 2020 at 11:54 AM Theodoros Gkountouvas <[hidden email]> wrote:

Hi Amit,

 

One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect that DF1 is used more than once (one time at  DF2 and another one at DF3). So, Spark is going to cache it the first time and it will load it from cache instead of running it again the second time.

 

I hope this helped,

Theo.

 

From: Amit Sharma <[hidden email]>
Sent: Monday, December 7, 2020 11:32 AM
To: [hidden email]
Subject: Caching

 

Hi All, I am using caching in my code. I have a DF like

val  DF1 = read csv.

val DF2 = DF1.groupBy().agg().select(.....)

 

Val DF3 =  read csv .join(DF1).join(DF2)

  DF3 .save.

 

If I do not cache DF2 or Df1 it is taking longer time  . But i am doing 1 action only why do I need to cache.

 

Thanks

Amit