RangePartitioning skewed data

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

RangePartitioning skewed data

This post has NOT been accepted by the mailing list yet.
This post was updated on .
Lets say I have a dataset of (K,V) where the keys are really skewed:

myDataRDD =
[(8, 1), (8, 13), (1,1), (2,4)]
[(8, 12), (8, 15), (8, 7), (8, 6), (8, 4), (8, 3), (8, 4), (10,2)]

If I applied a RangePartitioner to this set of data, say val rangePart = new RangePartitioner(4, myDataRDD) and then repartitioned the data, would I be able to get back 4 equally distributed partitions where Key=8 would be split across multiple partitions, or would all the 8 keys end up in one partition?

If this isn't possible, then is there some other partitioner that I could evenly distribute this dataset evenly? The reason I'd like them to be evenly distributed is because I am feeding this RDD into aggregateByKey() and I would like to reduce the data skew as the partitions are written out.

Also, does myDataRDD need to be sorted in order to correctly create the range partitioner? My research shows this may be the case.