That is a good idea, though I am not sure how much it will help as time to rsync is also dependent just on data size being copied. The other problem is that sometime we have dependencies across packages, so the first needs to be running before the second can start etc.
However I agree that it takes too long to launch say a 100 node cluster right now. If you want to take a shot at trying out some changes, you can fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2 and modify the number of rsync calls (each call to /root/spark-ec2/copy-dir launches an rsync now).
On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia <[hidden email]> wrote:
Spark-ec2 uses rsync to deploy many applications. It seem over time more and more applications have been added to the script, which has significantly slowed down the setup time.
Perhaps the script could be restructured this this way: Instead of rsyncing N times per application, we could have 1 rsync which deploys N applications.
This should remarkably speed up the setup part, specially for clusters with many nodes.