Configuring distributed caching with Spark and YARN

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Configuring distributed caching with Spark and YARN

Paul Schooss
Hello Folks, 

I was wondering if anyone was able to successfully setup distributed caching of jar files using CDH 5/YARN/Spark ? I can not seem to get my cluster working in that fashion. 


Regards, 

Paul Schooss
Reply | Threaded
Open this post in threaded view
|

Re: Configuring distributed caching with Spark and YARN

santhoma
Curious to know, were you able to do distributed caching for spark?

I have done that for hadoop and pig, but could not find a way to do it in spark
Reply | Threaded
Open this post in threaded view
|

Re: Configuring distributed caching with Spark and YARN

Mayur Rustagi
is this equivalent to addjar?


Mayur Rustagi
Ph: +1 (760) 203 3257


On Thu, Mar 27, 2014 at 3:58 AM, santhoma <[hidden email]> wrote:
Curious to know, were you able to do distributed caching for spark?

I have done that for hadoop and pig, but could not find a way to do it in
spark



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-distributed-caching-with-Spark-and-YARN-tp1074p3325.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Configuring distributed caching with Spark and YARN

santhoma
I think with addJar() there is no 'caching',  in the sense files will be copied everytime per job.
Whereas in hadoop distributed cache, files will be copied only once, and a symlink will be created to the cache file for subsequent runs:
https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html

Also,hadoop distributed cache can copy an archive  file to the node and unzip it automatically to current working dir. The advantage here is that the copying will be very fast..

Still looking for similar  mechanisms in SPARK