GroupBy issue while running K-Means - Dataframe

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

GroupBy issue while running K-Means - Dataframe

Deepak Sharma

Hi All,
I have a custom implementation of K-Means where it needs the data to be grouped by a key in a dataframe.
Now there is a big data skew for some of the keys , where it exceeds the BufferHolder:
 Cannot grow BufferHolder by size 17112 because the size after growing exceeds size limitation 2147483632

I tried solving it by converting the dataframe to RDD and then using reduceByKey on RDD and converting it back to RDD.
This gives Java Heap : Out of memory error.
Since it looks like a common issue , i was wondering how anyone would be solving this problem ?
-- 
Thanks
Deepak