Using s3 instead of broadcast

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Using s3 instead of broadcast

Aureliano Buendia
Hi,

My spark app has to broadcast 5 GB of RDD to about 100 workers at the beginning of each job. Obviously, this takes some time, and this time linearly increases as the number of workers increases.

Does it make sense instead of broadcasting the 5 GB RDD, to ask each worker to download it from s3? Download speed from s3 is not supposed to decrease as the number of workers increases.

If downloading from s3 from each worker makes sense, how to implement it? The closure code dispatched to workers cannot access the spark context object.