coalescing RDD into equally sized partitions

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

coalescing RDD into equally sized partitions

Walrus theCat
Hi,

sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect  yields

Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array())

How do I get something more like:

 Array(Array(0), Array(20), Array(40), Array(60), Array(80))

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: coalescing RDD into equally sized partitions

Matei Zaharia
Administrator
This happened because they were integers equal to 0 mod 5, and we used the default hashCode implementation for integers, which will map them all to 0. There’s no API method that will look at the resulting partition sizes and rebalance them, but you could use another hash function.

Matei

On Mar 24, 2014, at 5:20 PM, Walrus theCat <[hidden email]> wrote:

> Hi,
>
> sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect  yields
>
> Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array())
>
> How do I get something more like:
>
>  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
>
> Thanks

Reply | Threaded
Open this post in threaded view
|

Re: coalescing RDD into equally sized partitions

Walrus theCat
Oh.... so if I had something more reasonable, like RDD's full of tuples of say, (Int,Set,Set), I could expect a more uniform distribution?

Thanks


On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia <[hidden email]> wrote:
This happened because they were integers equal to 0 mod 5, and we used the default hashCode implementation for integers, which will map them all to 0. There’s no API method that will look at the resulting partition sizes and rebalance them, but you could use another hash function.

Matei

On Mar 24, 2014, at 5:20 PM, Walrus theCat <[hidden email]> wrote:

> Hi,
>
> sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect  yields
>
> Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array())
>
> How do I get something more like:
>
>  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
>
> Thanks


Reply | Threaded
Open this post in threaded view
|

Re: coalescing RDD into equally sized partitions

Walrus theCat
For the record, I tried this, and it worked.


On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat <[hidden email]> wrote:
Oh.... so if I had something more reasonable, like RDD's full of tuples of say, (Int,Set,Set), I could expect a more uniform distribution?

Thanks


On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia <[hidden email]> wrote:
This happened because they were integers equal to 0 mod 5, and we used the default hashCode implementation for integers, which will map them all to 0. There’s no API method that will look at the resulting partition sizes and rebalance them, but you could use another hash function.

Matei

On Mar 24, 2014, at 5:20 PM, Walrus theCat <[hidden email]> wrote:

> Hi,
>
> sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0 ).coalesce(5,true).glom.collect  yields
>
> Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(), Array(), Array())
>
> How do I get something more like:
>
>  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
>
> Thanks