How do you run your spark app?

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

How do you run your spark app?

ldmtwo
I want to ask this, not because I can't read endless documentation and several tutorials, but because there seems to be many ways of doing things and I keep having issues. How do you run your spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had to get Spark and Shark working and ended upgrading everything and dropped CDH support. Anyways, this is what I used with master=yarn-client and app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME $ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar /usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both my work email and personal.
Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

Evan R. Sparks
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

cotdp
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

cotdp
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job.  These are automatically built from source by our Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest release TAR.GZ direct from HDFS, unpack it and launch the appropriate script.

Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality.


On 19 June 2014 22:53, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

Sonal Goyal
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us.

Best Regards,
Sonal
Nube Technologies 






On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler <[hidden email]> wrote:
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job.  These are automatically built from source by our Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest release TAR.GZ direct from HDFS, unpack it and launch the appropriate script.

Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality.


On 19 June 2014 22:53, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.




Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

Shivani Rao
Hello Michael,

I have a quick question for you. Can you clarify the statement " build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job".  Can you give an example.

I am using sbt assembly as well to create a fat jar, and supplying the spark and hadoop locations in the class path. Inside the main() function where spark context is created, I use SparkContext.jarOfClass(this).toList add the fat jar to my spark context. However, I seem to be running into issues with this approach. I was wondering if you had any inputs Michael.

Thanks,
Shivani


On Thu, Jun 19, 2014 at 10:57 PM, Sonal Goyal <[hidden email]> wrote:
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us.

Best Regards,
Sonal
Nube Technologies 






On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler <[hidden email]> wrote:
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job.  These are automatically built from source by our Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest release TAR.GZ direct from HDFS, unpack it and launch the appropriate script.

Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality.


On 19 June 2014 22:53, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.







Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

Shrikar archak
Hi Shivani,

I use sbt assembly to create a fat jar .

Example of the sbt file is below.

import AssemblyKeys._ // put this at the top of the file

assemblySettings

mainClass in assembly := Some("FifaSparkStreaming")

name := "FifaSparkStreaming"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.0.0" % "provided",
                            "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
                            ("org.apache.spark" %% "spark-streaming-twitter" % "1.0.0").exclude("org.eclipse.jetty.orbit","javax.transaction")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.servlet")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.mail.glassfish")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.activation")
                                                                                       .exclude("com.esotericsoftware.minlog", "minlog"),
                            ("net.debasishg" % "redisclient_2.10" % "2.12").exclude("com.typesafe.akka","akka-actor_2.10"))

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
    case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case "application.conf" => MergeStrategy.concat
    case "unwanted.txt"     => MergeStrategy.discard
    case x => old(x)
  }
}


resolvers += "Akka Repository" at "http://repo.akka.io/releases/"


And I run as mentioned below.

LOCALLY : 
1)  sbt 'run AP1z4IYraYm5fqWhITWArY53x Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014'

If you want to submit on the cluster

CLUSTER:
2) spark-submit --class FifaSparkStreaming --master "spark://server-8-144:7077" --driver-memory 2048 --deploy-mode cluster FifaSparkStreaming-assembly-1.0.jar AP1z4IYraYm5fqWhITWArY53x Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014


Hope this helps.

Thanks,
Shrikar


On Fri, Jun 20, 2014 at 9:16 AM, Shivani Rao <[hidden email]> wrote:
Hello Michael,

I have a quick question for you. Can you clarify the statement " build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job".  Can you give an example.

I am using sbt assembly as well to create a fat jar, and supplying the spark and hadoop locations in the class path. Inside the main() function where spark context is created, I use SparkContext.jarOfClass(this).toList add the fat jar to my spark context. However, I seem to be running into issues with this approach. I was wondering if you had any inputs Michael.

Thanks,
Shivani


On Thu, Jun 19, 2014 at 10:57 PM, Sonal Goyal <[hidden email]> wrote:
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us.

Best Regards,
Sonal
Nube Technologies 






On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler <[hidden email]> wrote:
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job.  These are automatically built from source by our Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest release TAR.GZ direct from HDFS, unpack it and launch the appropriate script.

Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality.


On 19 June 2014 22:53, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.








Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

Shivani Rao
Hello Shrikar,

Thanks for your email. I have been using the same workflow as you did. But my questions was related to creation of the sparkContext. My question was 

If I am specifying jars in the "java -cp <jar-paths>", and adding to them to my build.sbt, do I need to additionally add them in my code while creating the sparkContext (sparkContext.setJars(" "))??


Thanks,
Shivani


On Fri, Jun 20, 2014 at 11:03 AM, Shrikar archak <[hidden email]> wrote:
Hi Shivani,

I use sbt assembly to create a fat jar .

Example of the sbt file is below.

import AssemblyKeys._ // put this at the top of the file

assemblySettings

mainClass in assembly := Some("FifaSparkStreaming")

name := "FifaSparkStreaming"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.0.0" % "provided",
                            "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
                            ("org.apache.spark" %% "spark-streaming-twitter" % "1.0.0").exclude("org.eclipse.jetty.orbit","javax.transaction")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.servlet")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.mail.glassfish")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.activation")
                                                                                       .exclude("com.esotericsoftware.minlog", "minlog"),
                            ("net.debasishg" % "redisclient_2.10" % "2.12").exclude("com.typesafe.akka","akka-actor_2.10"))

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
    case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case "application.conf" => MergeStrategy.concat
    case "unwanted.txt"     => MergeStrategy.discard
    case x => old(x)
  }
}


resolvers += "Akka Repository" at "http://repo.akka.io/releases/"


And I run as mentioned below.

LOCALLY : 
1)  sbt 'run AP1z4IYraYm5fqWhITWArY53x Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014'

If you want to submit on the cluster

CLUSTER:
2) spark-submit --class FifaSparkStreaming --master "spark://server-8-144:7077" --driver-memory 2048 --deploy-mode cluster FifaSparkStreaming-assembly-1.0.jar AP1z4IYraYm5fqWhITWArY53x Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014


Hope this helps.

Thanks,
Shrikar


On Fri, Jun 20, 2014 at 9:16 AM, Shivani Rao <[hidden email]> wrote:
Hello Michael,

I have a quick question for you. Can you clarify the statement " build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job".  Can you give an example.

I am using sbt assembly as well to create a fat jar, and supplying the spark and hadoop locations in the class path. Inside the main() function where spark context is created, I use SparkContext.jarOfClass(this).toList add the fat jar to my spark context. However, I seem to be running into issues with this approach. I was wondering if you had any inputs Michael.

Thanks,
Shivani


On Thu, Jun 19, 2014 at 10:57 PM, Sonal Goyal <[hidden email]> wrote:
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us.

Best Regards,
Sonal
Nube Technologies 






On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler <[hidden email]> wrote:
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job.  These are automatically built from source by our Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest release TAR.GZ direct from HDFS, unpack it and launch the appropriate script.

Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality.


On 19 June 2014 22:53, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.











--
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA
Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

Andrei
Hi Shivani,

Adding JARs to classpath (e.g. via "-cp" option) is needed to run your _local_ Java application, whatever it is. To deliver them to _other machines_ for execution you need to add them to SparkContext. And you can do it in 2 different ways:

1. Add them right from your code (your suggested "sparkContext.setJars(...)").
2. Use "spark-submit" and pass JARs from command line.

Note, that both options are easier to do if you assemble your code and all its dependencies into a single "fat" JAR instead of manually listing all needed libraries.




On Sat, Jun 21, 2014 at 1:47 AM, Shivani Rao <[hidden email]> wrote:
Hello Shrikar,

Thanks for your email. I have been using the same workflow as you did. But my questions was related to creation of the sparkContext. My question was 

If I am specifying jars in the "java -cp <jar-paths>", and adding to them to my build.sbt, do I need to additionally add them in my code while creating the sparkContext (sparkContext.setJars(" "))??


Thanks,
Shivani


On Fri, Jun 20, 2014 at 11:03 AM, Shrikar archak <[hidden email]> wrote:
Hi Shivani,

I use sbt assembly to create a fat jar .

Example of the sbt file is below.

import AssemblyKeys._ // put this at the top of the file

assemblySettings

mainClass in assembly := Some("FifaSparkStreaming")

name := "FifaSparkStreaming"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies ++= Seq("org.apache.spark" %% "spark-core" % "1.0.0" % "provided",
                            "org.apache.spark" %% "spark-streaming" % "1.0.0" % "provided",
                            ("org.apache.spark" %% "spark-streaming-twitter" % "1.0.0").exclude("org.eclipse.jetty.orbit","javax.transaction")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.servlet")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.mail.glassfish")
                                                                                       .exclude("org.eclipse.jetty.orbit","javax.activation")
                                                                                       .exclude("com.esotericsoftware.minlog", "minlog"),
                            ("net.debasishg" % "redisclient_2.10" % "2.12").exclude("com.typesafe.akka","akka-actor_2.10"))

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {
    case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case PathList("org", "apache", xs @ _*) => MergeStrategy.first
    case "application.conf" => MergeStrategy.concat
    case "unwanted.txt"     => MergeStrategy.discard
    case x => old(x)
  }
}


resolvers += "Akka Repository" at "http://repo.akka.io/releases/"


And I run as mentioned below.

LOCALLY : 
1)  sbt 'run AP1z4IYraYm5fqWhITWArY53x Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014'

If you want to submit on the cluster

CLUSTER:
2) spark-submit --class FifaSparkStreaming --master "spark://server-8-144:7077" --driver-memory 2048 --deploy-mode cluster FifaSparkStreaming-assembly-1.0.jar AP1z4IYraYm5fqWhITWArY53x Cyyz3Zr67tVK46G8dus5tSbc83KQOdtMDgYoQ5WLQwH0mTWzB6 115254720-OfJ4yFsUU6C6vBkEOMDlBlkIgslPleFjPwNcxHjN Qd76y2izncM7fGGYqU1VXYTxg1eseNuzcdZKm2QJyK8d1 fifa fifa2014


Hope this helps.

Thanks,
Shrikar


On Fri, Jun 20, 2014 at 9:16 AM, Shivani Rao <[hidden email]> wrote:
Hello Michael,

I have a quick question for you. Can you clarify the statement " build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job".  Can you give an example.

I am using sbt assembly as well to create a fat jar, and supplying the spark and hadoop locations in the class path. Inside the main() function where spark context is created, I use SparkContext.jarOfClass(this).toList add the fat jar to my spark context. However, I seem to be running into issues with this approach. I was wondering if you had any inputs Michael.

Thanks,
Shivani


On Thu, Jun 19, 2014 at 10:57 PM, Sonal Goyal <[hidden email]> wrote:
We use maven for building our code and then invoke spark-submit through the exec plugin, passing in our parameters. Works well for us.

Best Regards,
Sonal
Nube Technologies 






On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler <[hidden email]> wrote:
P.S. Last but not least we use sbt-assembly to build fat JAR's and build dist-style TAR.GZ packages with launch scripts, JAR's and everything needed to run a Job.  These are automatically built from source by our Jenkins and stored in HDFS.  Our Chronos/Marathon jobs fetch the latest release TAR.GZ direct from HDFS, unpack it and launch the appropriate script.

Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality.


On 19 June 2014 22:53, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.











--
Software Engineer
Analytics Engineering Team@ Box
Mountain View, CA

Reply | Threaded
Open this post in threaded view
|

Re: How do you run your spark app?

maasg
In reply to this post by cotdp
Hi Michael,

+1 on the deployment stack. (almost) Same thing here.
One question: Are you deploying the JobServer on Mesos?  Through Marathon?
I've been working on solving some of the port assignment issues on Mesos but I'm not there yet. Did you guys solved that?

-kr, Gerard.





On Thu, Jun 19, 2014 at 11:53 PM, Michael Cutler <[hidden email]> wrote:
When you start seriously using Spark in production there are basically two things everyone eventually needs:
  1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
  2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements, everything from crontab through to workflow managers like Oozie.

We opted for the following stack:
  • Marathon - init/control system for starting, stopping, and maintaining always-on applications.
  • Chronos - general-purpose scheduler for Mesos, supports job dependency graphs.
  • ** Spark Job Server - primarily for it's ability to reuse shared contexts with multiple jobs
The majority of our jobs are periodic (batch) jobs run through spark-sumit, and we have several always-on Spark Streaming jobs (also run through spark-submit).

We always use "client mode" with spark-submit because the Mesos cluster has direct connectivity to the Spark cluster and it means all the Spark stdout/stderr is externalised into Mesos logs which helps diagnosing problems.

I thoroughly recommend you explore using Mesos/Marathon/Chronos to run Spark and manage your Jobs, the Mesosphere tutorials are awesome and you can be up and running in literally minutes.  The Web UI's to both make it easy to get started without talking to REST API's etc.

Best,

Michael




On 19 June 2014 19:44, Evan R. Sparks <[hidden email]> wrote:
I use SBT, create an assembly, and then add the assembly jars when I create my spark context. The main executor I run with something like "java -cp ... MyDriver".

That said - as of spark 1.0 the preferred way to run spark applications is via spark-submit - http://spark.apache.org/docs/latest/submitting-applications.html


On Thu, Jun 19, 2014 at 11:36 AM, ldmtwo <[hidden email]> wrote:
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?

I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and Shark working and ended upgrading everything and dropped
CDH support. Anyways, this is what I used with master=yarn-client and
app_jar being Scala code compiled with Maven.

java -cp $CLASSPATH -Dspark.jars=$APP_JAR -Dspark.master=$MASTER $CLASSNAME
$ARGS

Do you use this? or something else? I could never figure out this method.
SPARK_HOME/bin/spark jar APP_JAR ARGS

For example:
bin/spark-class jar
/usr/lib/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
pi 10 10

Do you use SBT or Maven to compile? or something else?


** It seams that I can't get subscribed to the mailing list and I tried both
my work email and personal.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-run-your-spark-app-tp7935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.