Spark-ec2 setup is getting slower and slower

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark-ec2 setup is getting slower and slower

Aureliano Buendia
Hi,

Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time.

Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have 1 rsync which deploys N applications.

This should remarkably speed up the setup part, specially for clusters with many nodes.
Reply | Threaded
Open this post in threaded view
|

Re: Spark-ec2 setup is getting slower and slower

Shivaram Venkataraman-2
That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc.
However I agree that it takes too long to launch say a 100 node cluster right now. If you want to take a shot at trying out some changes, you can fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2  and modify the number of rsync calls (each call to /root/spark-ec2/copy-dir launches an rsync now).

Thanks
Shivaram


On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time.

Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have 1 rsync which deploys N applications.

This should remarkably speed up the setup part, specially for clusters with many nodes.