setting partitioners with hadoop rdds

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

setting partitioners with hadoop rdds

Imran Rashid
Hi,


I'm trying to figure out how to get partitioners to work correctly with hadoop rdds, so that I can get narrow dependencies & avoid shuffling.  I feel like I must be missing something obvious.

I can create an RDD with a parititioner of my choosing, shuffle it and then save it out to hdfs.  But I can't figure out how to get it to still have that partitioner after I read it back in from hdfs.  HadoopRDD always has the partitioner set to None, and there isn't any way for me to change it.

the reason I care is b/c if I can set the partitioner, then there would be a narrow dependency, so I can avoid a shuffle.  I have a big data set I'm saving on hdfs.  Then some time later, in a totally independent spark context, I read a little more data in, shuffle it w/ the same partitioner, and then want to join it to the previous data that was sitting on hdfs.

I guess this can't be done in general, since you don't have any guarantees on the how the file was saved in hdfs.  But, it still seems like there ought to be a way to do this, even if I need to enforce safety at the application level.

sorry if I'm missing something obvious ...

thanks,
Imran
Reply | Threaded
Open this post in threaded view
|

Re: setting partitioners with hadoop rdds

Matei Zaharia
Administrator
Hey Imran,

You probably have to create a subclass of HadoopRDD to do this, or some RDD that wraps around the HadoopRDD. It would be a cool feature but HDFS itself has no information about partitioning, so your application needs to track it.

Matei

On Jan 27, 2014, at 11:57 PM, Imran Rashid <[hidden email]> wrote:

> Hi,
>
>
> I'm trying to figure out how to get partitioners to work correctly with hadoop rdds, so that I can get narrow dependencies & avoid shuffling.  I feel like I must be missing something obvious.
>
> I can create an RDD with a parititioner of my choosing, shuffle it and then save it out to hdfs.  But I can't figure out how to get it to still have that partitioner after I read it back in from hdfs.  HadoopRDD always has the partitioner set to None, and there isn't any way for me to change it.
>
> the reason I care is b/c if I can set the partitioner, then there would be a narrow dependency, so I can avoid a shuffle.  I have a big data set I'm saving on hdfs.  Then some time later, in a totally independent spark context, I read a little more data in, shuffle it w/ the same partitioner, and then want to join it to the previous data that was sitting on hdfs.
>
> I guess this can't be done in general, since you don't have any guarantees on the how the file was saved in hdfs.  But, it still seems like there ought to be a way to do this, even if I need to enforce safety at the application level.
>
> sorry if I'm missing something obvious ...
>
> thanks,
> Imran

Reply | Threaded
Open this post in threaded view
|

RE: setting partitioners with hadoop rdds

Shao, Saisai
In reply to this post by Imran Rashid

Hi Imran,

 

Maybe you can try to implement your own InputFormat and InputSplit to control your own partition and read strategy, Spark supports custom InputFormat in HadoopRDD.

 

Thanks

Jerry

From: Imran Rashid [mailto:[hidden email]]
Sent: Tuesday, January 28, 2014 3:58 PM
To: [hidden email]
Subject: setting partitioners with hadoop rdds

 

Hi,

I'm trying to figure out how to get partitioners to work correctly with hadoop rdds, so that I can get narrow dependencies & avoid shuffling.  I feel like I must be missing something obvious.

I can create an RDD with a parititioner of my choosing, shuffle it and then save it out to hdfs.  But I can't figure out how to get it to still have that partitioner after I read it back in from hdfs.  HadoopRDD always has the partitioner set to None, and there isn't any way for me to change it.

the reason I care is b/c if I can set the partitioner, then there would be a narrow dependency, so I can avoid a shuffle.  I have a big data set I'm saving on hdfs.  Then some time later, in a totally independent spark context, I read a little more data in, shuffle it w/ the same partitioner, and then want to join it to the previous data that was sitting on hdfs.

I guess this can't be done in general, since you don't have any guarantees on the how the file was saved in hdfs.  But, it still seems like there ought to be a way to do this, even if I need to enforce safety at the application level.

sorry if I'm missing something obvious ...

thanks,
Imran

Reply | Threaded
Open this post in threaded view
|

Re: setting partitioners with hadoop rdds

Imran Rashid
In reply to this post by Matei Zaharia
Thanks for the info.  Do you think this would be useful in spark itself?  add a function to RDD like "assumePartitioner(partitioner: Partitioner, verify: Boolean)".  Where verify would run a mapPartitionsWithIndex, to check that every record was actually in the partition it belonged in?

I'm surprised this hasn't come up before -- maybe there is a better way to do something similar?


On Tue, Jan 28, 2014 at 12:25 AM, Matei Zaharia <[hidden email]> wrote:
Hey Imran,

You probably have to create a subclass of HadoopRDD to do this, or some RDD that wraps around the HadoopRDD. It would be a cool feature but HDFS itself has no information about partitioning, so your application needs to track it.

Matei

On Jan 27, 2014, at 11:57 PM, Imran Rashid <[hidden email]> wrote:

> Hi,
>
>
> I'm trying to figure out how to get partitioners to work correctly with hadoop rdds, so that I can get narrow dependencies & avoid shuffling.  I feel like I must be missing something obvious.
>
> I can create an RDD with a parititioner of my choosing, shuffle it and then save it out to hdfs.  But I can't figure out how to get it to still have that partitioner after I read it back in from hdfs.  HadoopRDD always has the partitioner set to None, and there isn't any way for me to change it.
>
> the reason I care is b/c if I can set the partitioner, then there would be a narrow dependency, so I can avoid a shuffle.  I have a big data set I'm saving on hdfs.  Then some time later, in a totally independent spark context, I read a little more data in, shuffle it w/ the same partitioner, and then want to join it to the previous data that was sitting on hdfs.
>
> I guess this can't be done in general, since you don't have any guarantees on the how the file was saved in hdfs.  But, it still seems like there ought to be a way to do this, even if I need to enforce safety at the application level.
>
> sorry if I'm missing something obvious ...
>
> thanks,
> Imran