(please notice this question was previously posted to https://stackoverflow.com/questions/49943655/spark-schedules-single-task-although-rdd-has-48-partitions)
We are running Spark 2.3 / Python 3.5.2. For a job we run following code (please notice that the input txt files are just a simplified example, in-fact these are large binary files and sc.binaryFiles(...) runs out of memory loading the content into memory, therefor only the filenames are parallelized and the executors open/read the content):
Where the app is a Python module (added to Spark using --py-files app.egg), simplified code is like this:
We notice that the cluster is not utilized fully during the first stages which we don't understand, and we are looking for ways to control this behavior.
Job0 Stage0 1Task 1min paralellize
Job1 Stage1 1Task 2min paralellize
Job2 Stage2 1Task 1min paralellize
Job3 Stage3 48Tasks 5min paralellize|mappartitions|map|mappartitions|existingRDD|sort
What are the first 3 jobs? And why isn't there 1 Job/Stage with the 48 tasks (as expected given the second parameter of parallelize set to 48)?
Excerpt from DEBUG logging:
|Free forum by Nabble||Edit this page|