Data locality during Spark RDD creation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Data locality during Spark RDD creation

Debasish Das
Hi,

I have HDFS and MapReduce running on 20 nodes and a experimental spark cluster running on subset of the HDFS node (say 8 of them).

If some ETL is done using MR most likely the data will be replicated across all 20 nodes (assuming I used all the nodes).

Is it a good idea to run spark cluster on all 20 nodes where HDFS is running so that all the RDDs are data local and the bulk data transfer is minimized ?

Thanks.
Deb
Reply | Threaded
Open this post in threaded view
|

Re: Data locality during Spark RDD creation

Andrew Ash
I definitely think so.  Network transfer is often a bottleneck for distributed jobs, especially if you're using groupBys or re-keying things often.

What network speed do you have between each HDFS node?  1GB?


On Fri, Jan 3, 2014 at 2:34 PM, Debasish Das <[hidden email]> wrote:
Hi,

I have HDFS and MapReduce running on 20 nodes and a experimental spark cluster running on subset of the HDFS node (say 8 of them).

If some ETL is done using MR most likely the data will be replicated across all 20 nodes (assuming I used all the nodes).

Is it a good idea to run spark cluster on all 20 nodes where HDFS is running so that all the RDDs are data local and the bulk data transfer is minimized ?

Thanks.
Deb