Missing Spark URL after staring the master

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Missing Spark URL after staring the master

bin wang
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin
Reply | Threaded
Open this post in threaded view
|

Re: Missing Spark URL after staring the master

Mayur Rustagi
I think you have been through enough :). 
Basically you have to download spark-ec2 scripts & run them. It'll just need your amazon secret key & access key, start your cluster, install everything, create security groups & give you the url, just login & go ahead...

Mayur Rustagi
Ph: +1 (760) 203 3257


On Mon, Mar 3, 2014 at 11:00 AM, Bin Wang <[hidden email]> wrote:
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin

Reply | Threaded
Open this post in threaded view
|

Re: Missing Spark URL after staring the master

Ognen Duzlevski-2
In reply to this post by bin wang
I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a "sparkuser" on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did this on the master only, I did sbt/sbt assemble and I set up the conf/spark-env.sh to point to the master which is an IP address (in my case 10.10.0.200, the port is the default 7077). I also set up the slaves file in the same subdirectory to have all 16 ip addresses of the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz file to each worker using the same "sparkuser" account and unpacked the .tgz on each slave (this will effectively replicate everything from master to all slaves - you can script this so you don't do it by hand).

Your AMI should have the distribution's version of Java and git installed by the way.

All you have to do then is sparkuser@spark-master> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) and it will all automagically start :)

All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS filesystem which is operated by a namenode outside the spark cluster while all the datanodes are the same nodes as the spark workers. This enables replication and extremely fast access since ephemeral is much faster than EBS or anything else on Amazon (you can do even better with SSD drives on this setup but it will cost ya).

If anyone is interested I can document our pipeline set up - I came up with it myself and do not have a clue as to what the industry standards are since I could not find any written instructions anywhere online about how to set up a whole data analytics pipeline from the point of ingestion to the point of analytics (people don't want to share their secrets? or am I just in the dark and incapable of using Google properly?). My requirement was that I wanted this to run within a VPC for added security and simplicity, the Amazon security groups get really old quickly. Added bonus is that you can use a VPN as an entry into the whole system and your cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only two options Amazon provides for their VPN gateways).

Ognen

 
On 3/3/14, 1:00 PM, Bin Wang wrote:
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski
Reply | Threaded
Open this post in threaded view
|

Re: Missing Spark URL after staring the master

Ognen Duzlevski-2
I should add that in this setup you really do not need to look for the printout of the master node's IP - you set it yourself a priori. If anyone is interested, let me know, I can write it all up so that people can follow some set of instructions. Who knows, maybe I can come up with a set of scripts to automate it all...

Ognen


On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a "sparkuser" on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did this on the master only, I did sbt/sbt assemble and I set up the conf/spark-env.sh to point to the master which is an IP address (in my case 10.10.0.200, the port is the default 7077). I also set up the slaves file in the same subdirectory to have all 16 ip addresses of the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz file to each worker using the same "sparkuser" account and unpacked the .tgz on each slave (this will effectively replicate everything from master to all slaves - you can script this so you don't do it by hand).

Your AMI should have the distribution's version of Java and git installed by the way.

All you have to do then is sparkuser@spark-master> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) and it will all automagically start :)

All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS filesystem which is operated by a namenode outside the spark cluster while all the datanodes are the same nodes as the spark workers. This enables replication and extremely fast access since ephemeral is much faster than EBS or anything else on Amazon (you can do even better with SSD drives on this setup but it will cost ya).

If anyone is interested I can document our pipeline set up - I came up with it myself and do not have a clue as to what the industry standards are since I could not find any written instructions anywhere online about how to set up a whole data analytics pipeline from the point of ingestion to the point of analytics (people don't want to share their secrets? or am I just in the dark and incapable of using Google properly?). My requirement was that I wanted this to run within a VPC for added security and simplicity, the Amazon security groups get really old quickly. Added bonus is that you can use a VPN as an entry into the whole system and your cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only two options Amazon provides for their VPN gateways).

Ognen

 
On 3/3/14, 1:00 PM, Bin Wang wrote:
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski
Reply | Threaded
Open this post in threaded view
|

Re: Missing Spark URL after staring the master

bin wang
Hi Ognen/Mayur, 

Thanks for the reply and it is good to know how easy it is to setup Spark on AWS cluster. 

My situation is a bit different from yours, our company already have a cluster and it really doesn't make that much sense not to use them. That is why I have been "going through" this. I really wish there are some tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster or .. some way to tweak the CDH Spark distribution, so it is up to date.

Ognen, of course it will be very helpful if you can 'history | grep spark... ' and document the work that you have done since you've already made it! 

Bin



On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski <[hidden email]> wrote:
I should add that in this setup you really do not need to look for the printout of the master node's IP - you set it yourself a priori. If anyone is interested, let me know, I can write it all up so that people can follow some set of instructions. Who knows, maybe I can come up with a set of scripts to automate it all...

Ognen



On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a "sparkuser" on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did this on the master only, I did sbt/sbt assemble and I set up the conf/spark-env.sh to point to the master which is an IP address (in my case 10.10.0.200, the port is the default 7077). I also set up the slaves file in the same subdirectory to have all 16 ip addresses of the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz file to each worker using the same "sparkuser" account and unpacked the .tgz on each slave (this will effectively replicate everything from master to all slaves - you can script this so you don't do it by hand).

Your AMI should have the distribution's version of Java and git installed by the way.

All you have to do then is sparkuser@spark-master> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) and it will all automagically start :)

All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS filesystem which is operated by a namenode outside the spark cluster while all the datanodes are the same nodes as the spark workers. This enables replication and extremely fast access since ephemeral is much faster than EBS or anything else on Amazon (you can do even better with SSD drives on this setup but it will cost ya).

If anyone is interested I can document our pipeline set up - I came up with it myself and do not have a clue as to what the industry standards are since I could not find any written instructions anywhere online about how to set up a whole data analytics pipeline from the point of ingestion to the point of analytics (people don't want to share their secrets? or am I just in the dark and incapable of using Google properly?). My requirement was that I wanted this to run within a VPC for added security and simplicity, the Amazon security groups get really old quickly. Added bonus is that you can use a VPN as an entry into the whole system and your cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only two options Amazon provides for their VPN gateways).

Ognen

 
On 3/3/14, 1:00 PM, Bin Wang wrote:
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski

Reply | Threaded
Open this post in threaded view
|

Re: Missing Spark URL after staring the master

Mayur Rustagi
I have on cloudera vm http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
which version are you trying to setup on cloudera.. also which cloudera version are you using...


Mayur Rustagi
Ph: +1 (760) 203 3257


On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang <[hidden email]> wrote:
Hi Ognen/Mayur, 

Thanks for the reply and it is good to know how easy it is to setup Spark on AWS cluster. 

My situation is a bit different from yours, our company already have a cluster and it really doesn't make that much sense not to use them. That is why I have been "going through" this. I really wish there are some tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster or .. some way to tweak the CDH Spark distribution, so it is up to date.

Ognen, of course it will be very helpful if you can 'history | grep spark... ' and document the work that you have done since you've already made it! 

Bin



On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski <[hidden email]> wrote:
I should add that in this setup you really do not need to look for the printout of the master node's IP - you set it yourself a priori. If anyone is interested, let me know, I can write it all up so that people can follow some set of instructions. Who knows, maybe I can come up with a set of scripts to automate it all...

Ognen



On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a "sparkuser" on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did this on the master only, I did sbt/sbt assemble and I set up the conf/spark-env.sh to point to the master which is an IP address (in my case 10.10.0.200, the port is the default 7077). I also set up the slaves file in the same subdirectory to have all 16 ip addresses of the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz file to each worker using the same "sparkuser" account and unpacked the .tgz on each slave (this will effectively replicate everything from master to all slaves - you can script this so you don't do it by hand).

Your AMI should have the distribution's version of Java and git installed by the way.

All you have to do then is sparkuser@spark-master> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) and it will all automagically start :)

All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS filesystem which is operated by a namenode outside the spark cluster while all the datanodes are the same nodes as the spark workers. This enables replication and extremely fast access since ephemeral is much faster than EBS or anything else on Amazon (you can do even better with SSD drives on this setup but it will cost ya).

If anyone is interested I can document our pipeline set up - I came up with it myself and do not have a clue as to what the industry standards are since I could not find any written instructions anywhere online about how to set up a whole data analytics pipeline from the point of ingestion to the point of analytics (people don't want to share their secrets? or am I just in the dark and incapable of using Google properly?). My requirement was that I wanted this to run within a VPC for added security and simplicity, the Amazon security groups get really old quickly. Added bonus is that you can use a VPN as an entry into the whole system and your cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only two options Amazon provides for their VPN gateways).

Ognen

 
On 3/3/14, 1:00 PM, Bin Wang wrote:
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski


Reply | Threaded
Open this post in threaded view
|

Re: Missing Spark URL after staring the master

bin wang
Hi Mayur, 

I am using CDH4.6.0p0.26.  And the latest Cloudera Spark parcel is Spark 0.9.0 CDH4.6.0p0.50. 
As I mentioned, somehow, the Cloudera Spark version doesn't contain the run-example shell scripts.. However, it is automatically configured and it is pretty easy to set up across the cluster... 

Thanks, 
Bin


On Tue, Mar 4, 2014 at 10:59 AM, Mayur Rustagi <[hidden email]> wrote:
I have on cloudera vm http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Cloudera_VM
which version are you trying to setup on cloudera.. also which cloudera version are you using...


Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Mon, Mar 3, 2014 at 4:29 PM, Bin Wang <[hidden email]> wrote:
Hi Ognen/Mayur, 

Thanks for the reply and it is good to know how easy it is to setup Spark on AWS cluster. 

My situation is a bit different from yours, our company already have a cluster and it really doesn't make that much sense not to use them. That is why I have been "going through" this. I really wish there are some tutorials teaching you how to set up Spark Cluster on baremetal CDH cluster or .. some way to tweak the CDH Spark distribution, so it is up to date.

Ognen, of course it will be very helpful if you can 'history | grep spark... ' and document the work that you have done since you've already made it! 

Bin



On Mon, Mar 3, 2014 at 2:06 PM, Ognen Duzlevski <[hidden email]> wrote:
I should add that in this setup you really do not need to look for the printout of the master node's IP - you set it yourself a priori. If anyone is interested, let me know, I can write it all up so that people can follow some set of instructions. Who knows, maybe I can come up with a set of scripts to automate it all...

Ognen



On 3/3/14, 3:02 PM, Ognen Duzlevski wrote:
I have a Standalone spark cluster running in an Amazon VPC that I set up by hand. All I did was provision the machines from a common AMI image (my underlying distribution is Ubuntu), I created a "sparkuser" on each machine and I have a /home/sparkuser/spark folder where I downladed spark. I did this on the master only, I did sbt/sbt assemble and I set up the conf/spark-env.sh to point to the master which is an IP address (in my case 10.10.0.200, the port is the default 7077). I also set up the slaves file in the same subdirectory to have all 16 ip addresses of the worker nodes (in my case 10.10.0.201-216). After sbt/sbt assembly was done on master, I then did cd ~/; tar -czf spark.tgz spark/ and I copied the resulting tgz file to each worker using the same "sparkuser" account and unpacked the .tgz on each slave (this will effectively replicate everything from master to all slaves - you can script this so you don't do it by hand).

Your AMI should have the distribution's version of Java and git installed by the way.

All you have to do then is sparkuser@spark-master> spark/sbin/start-all.sh (for 0.9, in 0.8.1 it is spark/bin/start-all.sh) and it will all automagically start :)

All my Amazon nodes come with 4x400 Gb of ephemeral space which I have set up into a 1.6TB RAID0 array on each node and I am pooling this into an HDFS filesystem which is operated by a namenode outside the spark cluster while all the datanodes are the same nodes as the spark workers. This enables replication and extremely fast access since ephemeral is much faster than EBS or anything else on Amazon (you can do even better with SSD drives on this setup but it will cost ya).

If anyone is interested I can document our pipeline set up - I came up with it myself and do not have a clue as to what the industry standards are since I could not find any written instructions anywhere online about how to set up a whole data analytics pipeline from the point of ingestion to the point of analytics (people don't want to share their secrets? or am I just in the dark and incapable of using Google properly?). My requirement was that I wanted this to run within a VPC for added security and simplicity, the Amazon security groups get really old quickly. Added bonus is that you can use a VPN as an entry into the whole system and your cluster instantly becomes "local" to you in terms of IPs etc. I use OpenVPN since I don't like Cisco nor Juniper (the only two options Amazon provides for their VPN gateways).

Ognen

 
On 3/3/14, 1:00 PM, Bin Wang wrote:
Hi there, 

I have a CDH cluster set up, and I tried using the Spark parcel come with Cloudera Manager, but it turned out they even don't have the run-example shell command in the bin folder. Then I removed it from the cluster and cloned the incubator-spark into the name node of my cluster, and built from source there successfully with everything as default.

I ran a few examples and everything seems work fine in the local mode. Then I am thinking about scale it to my cluster, which is what the "DISTRIBUTE + ACTIVATE" command does in Cloudera Manager. I want to add all the datanodes to the slaves and think I should run Spark in the standalone mode.

Say I am trying to set up Spark in the standalone mode following this instruction: 
However, it says "Once started, the master will print out a spark://HOST:PORT URL for itself, which you can use to connect workers to it, or pass as the “master” argument to SparkContext. You can also find this URL on the master’s web UI, which is http://localhost:8080 by default."

After I started the master, there is no URL printed on the screen and neither the web UI is running.
Here is the output:
[root@box incubator-spark]# ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /root/bwang_spark_new/incubator-spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-box.out

First Question: am I even in the ballpark to run Spark in standalone mode if I try to fully utilize my cluster? I saw there are four ways to launch Spark on a cluster, AWS-EC2, Spark in standalone, Apache Meso, Hadoop Yarn... which I guess standalone mode is the way to go?

Second Question: how to get the Spark URL of the cluster, why the output is not like what the instruction says?

Best regards, 

Bin

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski

-- 
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski