flatMap followed by mapPartitions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

flatMap followed by mapPartitions

Debasish Das
Hi,

I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ?

Thanks.
Deb
Reply | Threaded
Open this post in threaded view
|

Re: flatMap followed by mapPartitions

Mayur Rustagi
flatmap would have to shuffle data only if output RDD is expected to be partitioned by some key. 
RDD[X].flatmap(X=>RDD[Y])   
If it has to shuffle it should be local. 

Mayur Rustagi
Ph: +1 (760) 203 3257

On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das <[hidden email]> wrote:
Hi,

I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ?

Thanks.
Deb

Reply | Threaded
Open this post in threaded view
|

Re: flatMap followed by mapPartitions

Debasish Das
mapPartitions tried to hold data is memory which did not work for me..

I am doing flatMap followed by groupByKey now with HashPartitioner and number of blocks is 60 (Based on 120 cores I am running the job on)...

Now when the shuffle size < 100 GB it works fine...as flatMap shuffle goes to 200 GB, 400 GB...I am getting:

FetchFailed(BlockManagerId(1, istgbd013.verizon.com, 44377, 0), shuffleId=37, mapId=8, reduceId=54)

I have to shuffle because the memory on cluster is less than the shuffle size of 400 GB..

The job runs fine if I sample and decrease my shuffle size within 100 GB..

Does groupByKey does a combiner similar to reduceByKey and aggregateByKey ? I need a combiner operation to do some work on map side after flatMap followed by rest of the work on reducers..

On Wed, Nov 12, 2014 at 8:35 PM, Mayur Rustagi <[hidden email]> wrote:
flatmap would have to shuffle data only if output RDD is expected to be partitioned by some key. 
RDD[X].flatmap(X=>RDD[Y])   
If it has to shuffle it should be local. 

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257

On Thu, Nov 13, 2014 at 7:31 AM, Debasish Das <[hidden email]> wrote:
Hi,

I am doing a flatMap followed by mapPartitions to do some blocked operation...flatMap is shuffling data but this shuffle is strictly shuffling to disk and not over the network right ?

Thanks.
Deb