Why repartitionAndSortWithinPartitions slower than MapReducer

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Why repartitionAndSortWithinPartitions slower than MapReducer

周浥尘
Hi team,

I found the Spark method repartitionAndSortWithinPartitions spends twice as much time as using Mapreduce in some cases. 
I want to repartition the dataset accorading to split keys and save them to files in ascending. As the doc says, repartitionAndSortWithinPartitions “is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.” I thought it may be faster than MR, but actually, it is much more slower. I also adjust several configurations of spark, but that doesn't work.(Both Spark and Mapreduce run on a three-node cluster and share the same number of partitions.) 
Can this situation be explained or is there any approach to improve the performance of spark?

Thanks & Regards,
Yichen
Reply | Threaded
Open this post in threaded view
|

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

周浥尘
In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘 <[hidden email]> 于2018年8月20日周一 下午8:52写道:
Hi team,

I found the Spark method repartitionAndSortWithinPartitions spends twice as much time as using Mapreduce in some cases. 
I want to repartition the dataset accorading to split keys and save them to files in ascending. As the doc says, repartitionAndSortWithinPartitions “is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.” I thought it may be faster than MR, but actually, it is much more slower. I also adjust several configurations of spark, but that doesn't work.(Both Spark and Mapreduce run on a three-node cluster and share the same number of partitions.) 
Can this situation be explained or is there any approach to improve the performance of spark?

Thanks & Regards,
Yichen
Reply | Threaded
Open this post in threaded view
|

Re: Why repartitionAndSortWithinPartitions slower than MapReducer

Koert Kuipers
I assume you are using RDDs? What are you doing after the repartitioning + sorting, if anything?


On Aug 20, 2018 11:22, "周浥尘" <[hidden email]> wrote:
In addition to my previous email,
Environment: spark 2.1.2, hadoop 2.6.0-cdh5.11, Java 1.8, CentOS 6.6

周浥尘 <[hidden email]> 于2018年8月20日周一 下午8:52写道:
Hi team,

I found the Spark method repartitionAndSortWithinPartitions spends twice as much time as using Mapreduce in some cases. 
I want to repartition the dataset accorading to split keys and save them to files in ascending. As the doc says, repartitionAndSortWithinPartitions “is more efficient than calling `repartition` and then sorting within each partition because it can push the sorting down into the shuffle machinery.” I thought it may be faster than MR, but actually, it is much more slower. I also adjust several configurations of spark, but that doesn't work.(Both Spark and Mapreduce run on a three-node cluster and share the same number of partitions.) 
Can this situation be explained or is there any approach to improve the performance of spark?

Thanks & Regards,
Yichen