How to achieve this in Spark

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to achieve this in Spark

ssimanta
I've a RDD that contains ids (Long).

subsetids

res22: org.apache.spark.rdd.RDD[Long]


I've another RDD that has an Object (MyObject) where one of the field is an id (Long). 

allobjects 

res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]

Now I want to run filter on allobjects so that I can get a subset that matches with the ids that are present in my first RDD (i.e., subsetids)

Say something like - 

val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) ) 

However, there is no method "contains" so I'm looking for the most efficient way to achieving this in Spark. 

Thanks. 



Reply | Threaded
Open this post in threaded view
|

Re: How to achieve this in Spark

Mark Hamstra
Your problem is more basic than that.  You can't reference one RDD (subsetids) from within an operation on another RDD (allobjects.filter).


On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta <[hidden email]> wrote:
I've a RDD that contains ids (Long).

subsetids

res22: org.apache.spark.rdd.RDD[Long]


I've another RDD that has an Object (MyObject) where one of the field is an id (Long). 

allobjects 

res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]

Now I want to run filter on allobjects so that I can get a subset that matches with the ids that are present in my first RDD (i.e., subsetids)

Say something like - 

val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) ) 

However, there is no method "contains" so I'm looking for the most efficient way to achieving this in Spark. 

Thanks. 




Reply | Threaded
Open this post in threaded view
|

Re: How to achieve this in Spark

Nathan Kronenfeld
In reply to this post by ssimanta
It depends on how big subsetids is.

If subsetids is small, then you can just collect to the client and broadcast it as a set to use exactly as stated.

If subsetids is too big for that, you need to join it with the second set:

subsetids.map(i => (i, i)) // transform ids to a key/value form for use with join
.join(subsetObjs)  // To get just the objs with included ids; note subsetObjs should be in the (id, object) form
.map(_._2) // to get back to the original subsetObjs form of (id, object)

I hope this helps.
                      -Nathan



On Wed, Feb 19, 2014 at 5:23 PM, Soumya Simanta <[hidden email]> wrote:
I've a RDD that contains ids (Long).

subsetids

res22: org.apache.spark.rdd.RDD[Long]


I've another RDD that has an Object (MyObject) where one of the field is an id (Long). 

allobjects 

res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]

Now I want to run filter on allobjects so that I can get a subset that matches with the ids that are present in my first RDD (i.e., subsetids)

Say something like - 

val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) ) 

However, there is no method "contains" so I'm looking for the most efficient way to achieving this in Spark. 

Thanks. 






--
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: How to achieve this in Spark

Mayur Rustagi
In reply to this post by ssimanta
You can convert it into a map and ship it along the filter closure. This is a bad implementation if your Map is huge, 
You can convert map into a broadcast variable and send it along with the filter. 




On Wed, Feb 19, 2014 at 2:23 PM, Soumya Simanta <[hidden email]> wrote:
I've a RDD that contains ids (Long).

subsetids

res22: org.apache.spark.rdd.RDD[Long]


I've another RDD that has an Object (MyObject) where one of the field is an id (Long). 

allobjects 

res23: org.apache.spark.rdd.RDD[MyObject] = MappedRDD[272]

Now I want to run filter on allobjects so that I can get a subset that matches with the ids that are present in my first RDD (i.e., subsetids)

Say something like - 

val subsetObjs = allobjects.filter( x => subsetids.contains(x.getId) ) 

However, there is no method "contains" so I'm looking for the most efficient way to achieving this in Spark. 

Thanks.