Why RDD is not cached?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Why RDD is not cached?

shahabm
Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see  "  BlockManagerMasterActor: Added rdd_XX in memory " nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see "BlockManagerMasterActor: Added rdd_0_1 in memory" and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab
Reply | Threaded
Open this post in threaded view
|

Re: Why RDD is not cached?

Jagat Singh
What setting you are using for

persist() or cache()

http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

On Tue, Oct 28, 2014 at 6:18 PM, shahab <[hidden email]> wrote:
Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see  "  BlockManagerMasterActor: Added rdd_XX in memory " nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see "BlockManagerMasterActor: Added rdd_0_1 in memory" and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab

Reply | Threaded
Open this post in threaded view
|

Re: Why RDD is not cached?

sowen
In reply to this post by shahabm

Did you just call cache()? By itself it does nothing but once an action requires it to be computed it should become cached.

On Oct 28, 2014 8:19 AM, "shahab" <[hidden email]> wrote:
Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see  "  BlockManagerMasterActor: Added rdd_XX in memory " nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see "BlockManagerMasterActor: Added rdd_0_1 in memory" and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab
Reply | Threaded
Open this post in threaded view
|

Re: Why RDD is not cached?

shahabm
I used Cache followed by a "count" on RDD to ensure that caching is performed.

val rdd = srdd.flatMap(mapProfile_To_Sessions).cache

val count = rdd.count

//so at this point RDD should be cahed ? right? 


On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen <[hidden email]> wrote:

Did you just call cache()? By itself it does nothing but once an action requires it to be computed it should become cached.

On Oct 28, 2014 8:19 AM, "shahab" <[hidden email]> wrote:
Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see  "  BlockManagerMasterActor: Added rdd_XX in memory " nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see "BlockManagerMasterActor: Added rdd_0_1 in memory" and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab

Reply | Threaded
Open this post in threaded view
|

Re: Why RDD is not cached?

Mayur Rustagi
What is the partition count of the RDD, its possible that you dont have enough memory to store the whole RDD on a single machine. Can you try forcibly repartitioning the RDD & then cacheing.  
Regards
Mayur

On Tue Oct 28 2014 at 1:19:09 AM shahab <[hidden email]> wrote:
I used Cache followed by a "count" on RDD to ensure that caching is performed.

val rdd = srdd.flatMap(mapProfile_To_Sessions).cache

val count = rdd.count

//so at this point RDD should be cahed ? right? 


On Tue, Oct 28, 2014 at 8:35 AM, Sean Owen <[hidden email]> wrote:

Did you just call cache()? By itself it does nothing but once an action requires it to be computed it should become cached.

On Oct 28, 2014 8:19 AM, "shahab" <[hidden email]> wrote:
Hi,

I have a standalone spark , where the executor is set to have 6.3 G memory , as I am using two workers so in total there 12.6 G memory and 4 cores.

I am trying to cache a RDD with approximate size of 3.2 G, but apparently it is not cached as neither I can see  "  BlockManagerMasterActor: Added rdd_XX in memory " nor  the performance of running the tasks is improved

But, why it is not cached when there is enough memory storage?
I tried with smaller RDDs. 1 or 2 G and it works, at least I could see "BlockManagerMasterActor: Added rdd_0_1 in memory" and improvement in results.

Any idea what I am missing in my settings, or... ?

thanks,
/Shahab