How does partitioning happen for binary files in spark ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How does partitioning happen for binary files in spark ?

ashwini anand
This post has NOT been accepted by the mailing list yet.
By looking into the source code, I found that for textFile(), the partitioning is computed by the computeSplitSize() function in FileInputFormat class. This function takes into consideration the minPartitions value passed by user. As per my understanding , the same thing for binaryFiles() is computed by the setMinPartitions() function of PortableDataStream class. This setMinPartitions() function completely ignores the minPartitions value passed by user. However I find that in my application somehow the partition varies based on the minPartition value in case of binaryFiles() too. I have no idea how this is happening. Please help me understand how the partitioning happens in case of binaryFiles().

source code for setMinPartitions() is as below: def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) val defaultParallelism = sc.defaultParallelism val files = listStatus(context).asScala val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum val bytesPerCore = totalBytes / defaultParallelism val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore)) super.setMaxSplitSize(maxSplitSize) }