Task splitting among workers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Task splitting among workers

David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?
Reply | Threaded
Open this post in threaded view
|

Re: Task splitting among workers

Patrick Wendell
For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that contains its input to achieve higher throughput. Otherwise, all of the nodes are considered equally fit to run any task, and Spark just load balances across them.


On Sat, Apr 19, 2014 at 9:25 PM, David Thomas <[hidden email]> wrote:
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

Reply | Threaded
Open this post in threaded view
|

Re: Task splitting among workers

Arpit Tak-3
1.) How about if data is in S3  and we cached in memory , instead of hdfs ?
2.) How is the numbers of reducers determined in both case .

Even if I specify set.mapred.reduce.tasks=50, still somehow reducers allocated are only 2, instead of 50. Although query/tasks gets completed.

Regards,
Arpit


 


On Mon, Apr 21, 2014 at 9:33 AM, Patrick Wendell <[hidden email]> wrote:
For a HadoopRDD, first the spark scheduler calculates the number of tasks based on input splits. Usually people use this with HDFS data so in that case it's based on HDFS blocks. If the HDFS datanodes are co-located with the Spark cluster then it will try to run the tasks on the data node that contains its input to achieve higher throughput. Otherwise, all of the nodes are considered equally fit to run any task, and Spark just load balances across them.


On Sat, Apr 19, 2014 at 9:25 PM, David Thomas <[hidden email]> wrote:
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?