number of partitions in join: Spark documentation misleading!

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
mrm
Reply | Threaded
Open this post in threaded view
|

number of partitions in join: Spark documentation misleading!

mrm
This post has NOT been accepted by the mailing list yet.
Hi all,

I was looking for an explanation on the number of partitions for a joined rdd.

The documentation of Spark 1.3.1. says that:
"For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD."
https://spark.apache.org/docs/latest/configuration.html

And the Partitioner.scala comments (line 51) state that:
"Unless spark.default.parallelism is set, the number of partitions will be the same as the number of partitions in the largest upstream RDD, as this should be least likely to cause out-of-memory errors."

But this is misleading for the Python API where if you do rddA.join(rddB), the output number of partitions is the number of partitions of A plus the number of partitions of B!