standalone vs YARN

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

standalone vs YARN

ishaaq
Hi all,
I am evaluating Spark to use here at my work.

We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop 2.3.

I am trying to work out whether I should install YARN or simply just setup a Spark standalone cluster. We already use ZooKeeper so it isn't a problem to setup HA. I am puzzled however as to how the Spark nodes can coordinate on data locality - i.e., assuming I install the nodes on the same machines as the DFS data nodes, I don't understand how Spark can work out which nodes should get which splits of the jobs?

Anyway, my bigger question remains: YARN or standalone? Which is the more stable option currently? Which is the more future-proof option?

Thanks,
Ishaaq
Reply | Threaded
Open this post in threaded view
|

Re: standalone vs YARN

Prashant Sharma
Hi Ishaaq,

answers inline from what I know, I had like to be corrected though.

On Tue, Apr 15, 2014 at 5:58 PM, ishaaq <[hidden email]> wrote:
Hi all,
I am evaluating Spark to use here at my work.

We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop
2.3.

This is not really a requirement for spark, if you are doing for some other reason great !
 
I am trying to work out whether I should install YARN or simply just setup a
Spark standalone cluster. We already use ZooKeeper so it isn't a problem to
setup HA. I am puzzled however as to how the Spark nodes can coordinate on
data locality - i.e., assuming I install the nodes on the same machines as
the DFS data nodes, I don't understand how Spark can work out which nodes
should get which splits of the jobs?

This happens exactly the same way hadoop's mapreduce figures out data locality. Since we support hadoop's inputformats(which also has the information on how data is partitioned) etc. So having spark workers share the same nodes as your DFS is a good idea.  
 
Anyway, my bigger question remains: YARN or standalone? Which is the more
stable option currently? Which is the more future-proof option?


Well I think standalone is stable enough for all purposes and Spark's yarn support has been keeping up with latest hadoop versions too. It depends on the fact that if you are already using yarn and don't want the hassle of setting up another cluster manager you can probably prefer yarn. 
 
Thanks,
Ishaaq



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/standalone-vs-YARN-tp4271.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: standalone vs YARN

Surendranauth Hiraman
Prashant,

In another email thread several weeks ago, it was mentioned that YARN support is considered beta until Spark 1.0. Is that not the case?

-Suren



On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma <[hidden email]> wrote:
Hi Ishaaq,

answers inline from what I know, I had like to be corrected though.

On Tue, Apr 15, 2014 at 5:58 PM, ishaaq <[hidden email]> wrote:
Hi all,
I am evaluating Spark to use here at my work.

We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop
2.3.

This is not really a requirement for spark, if you are doing for some other reason great !
 
I am trying to work out whether I should install YARN or simply just setup a
Spark standalone cluster. We already use ZooKeeper so it isn't a problem to
setup HA. I am puzzled however as to how the Spark nodes can coordinate on
data locality - i.e., assuming I install the nodes on the same machines as
the DFS data nodes, I don't understand how Spark can work out which nodes
should get which splits of the jobs?

This happens exactly the same way hadoop's mapreduce figures out data locality. Since we support hadoop's inputformats(which also has the information on how data is partitioned) etc. So having spark workers share the same nodes as your DFS is a good idea.  
 
Anyway, my bigger question remains: YARN or standalone? Which is the more
stable option currently? Which is the more future-proof option?


Well I think standalone is stable enough for all purposes and Spark's yarn support has been keeping up with latest hadoop versions too. It depends on the fact that if you are already using yarn and don't want the hassle of setting up another cluster manager you can probably prefer yarn. 
 
Thanks,
Ishaaq



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/standalone-vs-YARN-tp4271.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




--
                                                            
SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: [hidden email]elos.io
W: www.velos.io