[PySpark] - Binary File Partition

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[PySpark] - Binary File Partition

This post has NOT been accepted by the mailing list yet.
This post was updated on .

I am using Spark 1.6.2 and is there a known bug where number of partitions will always be 2 when minPartitions is not specified as below

images = sc.binaryFiles("s3n://********")

I was looking at the source code for PortableDataStream.scala which I believe is used for when we invoke the binary files interface and I see the below code

  def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
    val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
    val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
    val defaultParallelism = sc.defaultParallelism
    val files = listStatus(context).asScala
    val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum
    val bytesPerCore = totalBytes / defaultParallelism
    val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

Does it mean that minPartitions will no longer be used in the partition determination calculation?

Kindly advice.