Spark context jar confusions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark context jar confusions

Aureliano Buendia
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?
Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Eugen Cepoi
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?

Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Aureliano Buendia
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?


Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Eugen Cepoi
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?



Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Aureliano Buendia
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?




Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Archit Thakur
Aureliano, It doesn't matter actually. specifying "local" as your spark master only does is It uses the single JVM to run whole application. Making a cluster and then specifying "spark://localhost:7077" runs it on a set of machines. Running spark in lcoal mode will be helpful for debugging purposes but will perform much slower than if you have a cluster of 3-4-n machines. If you do not have a set of machines, you can make your same machine as a slave and start both master and slave on the same machine. Go through Apache Spark home to know more about starting various node. Thx.



On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <[hidden email]> wrote:
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?





Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Archit Thakur
Eugen, you said spark sends the jar to each worker, if we specify it. What if we only create a fat jar and do not do the sc.jarOfclass(class)? If we have created a fat jar. Won't all of the class be available on the slave node? What if access it in the code which is supposed to be executed on one of the slave node? Eg, Object Z. which is present the fat jar and is accessed in the map function(which is executed distributedly?). Won't it be accessible(Coz it is at compile time) ? It usually is, Isn't it?


On Thu, Jan 2, 2014 at 6:02 PM, Archit Thakur <[hidden email]> wrote:
Aureliano, It doesn't matter actually. specifying "local" as your spark master only does is It uses the single JVM to run whole application. Making a cluster and then specifying "spark://localhost:7077" runs it on a set of machines. Running spark in lcoal mode will be helpful for debugging purposes but will perform much slower than if you have a cluster of 3-4-n machines. If you do not have a set of machines, you can make your same machine as a slave and start both master and slave on the same machine. Go through Apache Spark home to know more about starting various node. Thx.



On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <[hidden email]> wrote:
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?






Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Eugen Cepoi
In reply to this post by Aureliano Buendia
When developing I am using local[2] that launches a local cluster with 2 workers. In most cases it is fine, I just encountered some strange behaviours for broadcasted variables, in local mode no broadcast is done (at least in 0.8). You also have access to the ui in that case at localhost:4040.

In dev mode I am directly launching my main class from intellij so no I don't need to build the fat jar.


2014/1/2 Aureliano Buendia <[hidden email]>
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?





Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Eugen Cepoi
In reply to this post by Archit Thakur
Spark will send the closures to workers. If you don't have any external dependency in your closure (using only spark types & scala/java) then it will work fine. But now suppose you use some classes you have defined in your project or depend on some common libs like jodatime. The workers don't know about those classes, they must be in your classpath. Thus you need to tell to the spark context which jars must be added in your classpath and shared to the workers. Doing a fat jar is just easier than having a list of jars.

To test it you can try with the spark shell, do something like sc.makeRDD(Seq(DateTime.now(), DateTime.now())).map(date => date.getMillis -> date).collect

when launching the shell do SPARK_CLASSPATH=path/to/joda-time.jar spark-shell

if you don't do sc.addJar("path/to/jodatime.jar") you will get classnotfound exceptions





2014/1/2 Archit Thakur <[hidden email]>
Eugen, you said spark sends the jar to each worker, if we specify it. What if we only create a fat jar and do not do the sc.jarOfclass(class)? If we have created a fat jar. Won't all of the class be available on the slave node? What if access it in the code which is supposed to be executed on one of the slave node? Eg, Object Z. which is present the fat jar and is accessed in the map function(which is executed distributedly?). Won't it be accessible(Coz it is at compile time) ? It usually is, Isn't it?


On Thu, Jan 2, 2014 at 6:02 PM, Archit Thakur <[hidden email]> wrote:
Aureliano, It doesn't matter actually. specifying "local" as your spark master only does is It uses the single JVM to run whole application. Making a cluster and then specifying "spark://localhost:7077" runs it on a set of machines. Running spark in lcoal mode will be helpful for debugging purposes but will perform much slower than if you have a cluster of 3-4-n machines. If you do not have a set of machines, you can make your same machine as a slave and start both master and slave on the same machine. Go through Apache Spark home to know more about starting various node. Thx.



On Thu, Jan 2, 2014 at 5:21 PM, Aureliano Buendia <[hidden email]> wrote:
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?







Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Aureliano Buendia
In reply to this post by Eugen Cepoi



On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <[hidden email]> wrote:
When developing I am using local[2] that launches a local cluster with 2 workers. In most cases it is fine, I just encountered some strange behaviours for broadcasted variables, in local mode no broadcast is done (at least in 0.8).

That's not good. This could hide bugs in production.
 
You also have access to the ui in that case at localhost:4040.

That server has a short life, it dies when the program exits.
 

In dev mode I am directly launching my main class from intellij so no I don't need to build the fat jar.

Why is that it is not possible to work with spark://localhost:7077 while developing? This allows to monitor and review the jobs, while keeping a record of the past jobs.

I've never been able to connect to spark://localhost:7077 in development, I get:

WARN cluster.ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

The ui says the workers are alive and they do have plenty of memory. Also, I tried the exact spark master name given by the ui with no luck (apparently akka is too fragile and sensitive to this). Also, turning off firewall on os x had no effect.
 


2014/1/2 Aureliano Buendia <[hidden email]>
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?






Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Eugen Cepoi



2014/1/2 Aureliano Buendia <[hidden email]>



On Thu, Jan 2, 2014 at 1:19 PM, Eugen Cepoi <[hidden email]> wrote:
When developing I am using local[2] that launches a local cluster with 2 workers. In most cases it is fine, I just encountered some strange behaviours for broadcasted variables, in local mode no broadcast is done (at least in 0.8).

That's not good. This could hide bugs in production.

That depends on what you want to test...spark is really easy to unit test, IMO when developping you don't need a full cluster.
 
 
You also have access to the ui in that case at localhost:4040.

That server has a short life, it dies when the program exits.

Sure, but you are developing at that moment, you want to make unit tests and make sure they pass, no?

 

In dev mode I am directly launching my main class from intellij so no I don't need to build the fat jar.

Why is that it is not possible to work with spark://localhost:7077 while developing? This allows to monitor and review the jobs, while keeping a record of the past jobs.

I've never been able to connect to spark://localhost:7077 in development, I get:

WARN cluster.ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory


 
The ui says the workers are alive and they do have plenty of memory. Also, I tried the exact spark master name given by the ui with no luck (apparently akka is too fragile and sensitive to this). Also, turning off firewall on os x had no effect.
 


2014/1/2 Aureliano Buendia <[hidden email]>
How about when developing the spark application, do you use "localhost", or "spark://localhost:7077" for spark context master during development?

Using "spark://localhost:7077" is a good way to simulate the production driver and it provides the web ui. When using "spark://localhost:7077", is it required to create the fat jar? Wouldn't that significantly slow down the development cycle?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?







Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Aureliano Buendia
In reply to this post by Eugen Cepoi
Eugen, I noticed that you are including hadoop in your fat jar:

<include>org.apache.hadoop:*</include>

This would take a big chunk of the fat jar. Isn't this jar already included in spark?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?




Reply | Threaded
Open this post in threaded view
|

Re: Spark context jar confusions

Eugen Cepoi
Indeed you don't need it, just make sure that it is in your classpath. But anyway the jar is not so big, I mean compared to what next your job will do, sending some mo over the network seems OK to me.


2014/1/5 Aureliano Buendia <[hidden email]>
Eugen, I noticed that you are including hadoop in your fat jar:

<include>org.apache.hadoop:*</include>

This would take a big chunk of the fat jar. Isn't this jar already included in spark?


On Thu, Jan 2, 2014 at 11:38 AM, Eugen Cepoi <[hidden email]> wrote:
It depends how you deploy, I don't find it so complicated...

1) To build the fat jar I am using maven (as I am not familiar with sbt).

Inside I have something like that, saying which libs should be used in the fat jar (the others won't be present in the final artifact).

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <minimizeJar>true</minimizeJar>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                            <artifactSet>
                                <includes>
                                    <include>org.apache.hbase:*</include>
                                    <include>org.apache.hadoop:*</include>
                                    <include>com.typesafe:config</include>
                                    <include>org.apache.avro:*</include>
                                    <include>joda-time:*</include>
                                    <include>org.joda:*</include>
                                </includes>
                            </artifactSet>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>


2) The App is the jar you have built, so you ship it to the driver node (it depends a lot on how you are planing to use it, debian packaging, a plain old scp, etc) to run it you can do something like:

$SPARK_HOME/spark-class SPARK_CLASSPATH=PathToYour.jar com.myproject.MyJob

where MyJob is the entry point to your job it defines a main method.

3) I don't know whats the "common way" but I am doing things this way: build the fat jar, provide some launch scripts, make debian packaging, ship it to a node that plays the role of the driver, run it over mesos using the launch scripts + some conf.


2014/1/2 Aureliano Buendia <[hidden email]>
I wasn't aware of jarOfClass. I wish there was only one good way of deploying in spark, instead of many ambiguous methods. (seems like spark has followed scala in that there are more than one way of accomplishing a job, making scala an overcomplicated language)

1. Should sbt assembly be used to make the fat jar? If so, which sbt should be used? My local sbt or that $SPARK_HOME/sbt/sbt? Why is that spark is shipped with a separate sbt?

2. Let's say we have the dependencies fat jar which is supposed to be shipped to the workers. Now how do we deploy the main app which is supposed to be executed on the driver? Make jar another jar out of it? Does sbt assembly also create that jar?

3. Is calling sc.jarOfClass() the most common way of doing this? I cannot find any example by googling. What's the most common way that people use?



On Thu, Jan 2, 2014 at 10:58 AM, Eugen Cepoi <[hidden email]> wrote:
Hi,

This is the list of the jars you use in your job, the driver will send all those jars to each worker (otherwise the workers won't have the classes you need in your job). The easy way to go is to build a fat jar with your code and all the libs you depend on and then use this utility to get the path: SparkContext.jarOfClass(YourJob.getClass)


2014/1/2 Aureliano Buendia <[hidden email]>
Hi,

I do not understand why spark context has an option for loading jars at runtime.

As an example, consider this:

object BroadcastTest {
  def main(args: Array[String]) {
val sc = new SparkContext(args(0), "Broadcast Test",
      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
}
}


This is the example, or the application that we want to run, what does SPARK_EXAMPLES_JAR supposed to be?
In this particular case, the BroadcastTest example is self-contained, why would it want to load other unrelated example jars?

Finally, how does this help a real world spark application?