Joined RDD

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Joined RDD

ajay garg
Hi,
     I have two RDDs A and B which are created from reading file from HDFS. I have a third RDD C which is created by taking join of A and B. All three RDDs (A, B and C ) are not cached.
Now if I perform any action on C (let say collect), action is served without reading any data from the disk.
Since no data is cached in spark how is action on C is served without reading data from disk.

Thanks
--Ajay
Reply | Threaded
Open this post in threaded view
|

Re: Joined RDD

Qin Wei
 I think it is because A.join(B) is a shuffle map stage, whose result is stored temporarily (i'm not sure it's in memeory or in disk)
I saw the word "map output" in the log of my spark application, i think it is the intermediate result of my application, and according to the log, it is stored


qinwei
 
Date: 2014-11-13 14:56
Subject: Joined RDD
Hi,
     I have two RDDs A and B which are created from reading file from HDFS.
I have a third RDD C which is created by taking join of A and B. All three
RDDs (A, B and C ) are not cached.
Now if I perform any action on C (let say collect), action is served without
reading any data from the disk.
Since no data is cached in spark how is action on C is served without
reading data from disk.
 
Thanks
--Ajay
 
 
 
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joined-RDD-tp18820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
 
Reply | Threaded
Open this post in threaded view
|

Re: Joined RDD

Mayur Rustagi
In reply to this post by ajay garg
First of all any action is only performed when you trigger a collect, 
When you trigger collect, at that point it retrieves data from disk joins the datasets together & delivers it to you. 

Mayur Rustagi
Ph: +1 (760) 203 3257

On Thu, Nov 13, 2014 at 12:26 PM, ajay garg <[hidden email]> wrote:
Hi,
     I have two RDDs A and B which are created from reading file from HDFS.
I have a third RDD C which is created by taking join of A and B. All three
RDDs (A, B and C ) are not cached.
Now if I perform any action on C (let say collect), action is served without
reading any data from the disk.
Since no data is cached in spark how is action on C is served without
reading data from disk.

Thanks
--Ajay



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Joined-RDD-tp18820.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Joined RDD

ajay garg
Yes that is my understanding of how it should work.
But in my case when I call collect first time, it reads the data from files on the disk.
Subsequent collect queries are not reading data files ( Verified from the logs.)
On spark ui I see only shuffle read and no shuffle write.