Is uberjar a recommended way of running Spark/Scala applications?

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Is uberjar a recommended way of running Spark/Scala applications?

Andrei
I'm using Spark 1.0 and sbt assembly plugin to create uberjar of my application. However, when I run assembly command, I get a number of errors like this: 

java.lang.RuntimeException: deduplicate: different file contents found in the following:
/home/username/.ivy2/cache/com.esotericsoftware.kryo/kryo/bundles/kryo-2.21.jar:com/esotericsoftware/minlog/Log$Logger.class
/home/username/.ivy2/cache/com.esotericsoftware.minlog/minlog/jars/minlog-1.2.jar:com/esotericsoftware/minlog/Log$Logger.class
...

As far as I can see, Spark Core depends on both - Minlog and Kryo, and the latter includes Minlog classes itself. Classes are binary different, so assembly can't combine them. And there's a number of such conflicts - I fixed some of them manually via mergeStrategy, but list of exceptions becomes larger and larger. I can continues, but it just does't look like the right way. 

My questions are: 

1. Is an uberjar a recommended way of running Spark applications? 
2. If so, should I include Spark itself into this large jar? 
3. If not, what is a recommended way to do both - development and deployment (assuming ordinary sbt project). 

Thanks, 
Andrei
Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

jaranda
This post was updated on .
Hi Andrei,

I think the preferred way to deploy Spark jobs is by using the sbt package/run tasks instead of using the sbt assembly plugin. In any case, as you comment, the mergeStrategy in combination with some dependency exlusions should fix your problems. Have a look at this gist for further details (I just followed some recommendations commented in the sbt assembly plugin documentation).

Up to now I haven't found a proper way to combine my development/deployment phases, although I must say my experience in Spark is pretty poor (it really depends on your deployment requirements as well). In this case, I think someone else could give you some further insights.

Best,
Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

Andrei
Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow).

I like the idea of using "sbt package", since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this case I'll have to handle jar list manually via Spark context. Is there a way to automate this process? E.g. when I was a Clojure guy, I could run "lein deps" (lein is a build tool similar to sbt) to download all dependencies and then just enumerate them from my app. Maybe you have heard of something like that for Spark/SBT?

Thanks,
Andrei


On Thu, May 29, 2014 at 3:48 PM, jaranda <[hidden email]> wrote:
Hi Andrei,

I think the preferred way to deploy Spark jobs is by using the sbt package
task instead of using the sbt assembly plugin. In any case, as you comment,
the mergeStrategy in combination with some dependency exlusions should fix
your problems. Have a look at  this gist
<https://gist.github.com/JordiAranda/bdbad58d128c14277a05>   for further
details (I just followed some recommendations commented in the sbt assembly
plugin documentation).

Up to now I haven't found a proper way to combine my development/deployment
phases, although I must say my experience in Spark is pretty poor (it really
depends in your deployment requirements as well). In this case, I think
someone else could give you some further insights.

Best,



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

Stephen Boesch

The MergeStrategy combined with sbt assembly did work for me.  This is not painless: some trial and error and the assembly may take multiple minutes. 

You will likely want to filter out some additional classes from the generated jar file.  Here is an SOF answer to explain that and with IMHO the best answer snippet included here (in this case the OP understandably did not want to not include javax.servlet.Servlet)



mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) => ms filter { case (file, toPath) => toPath != "javax/servlet/Servlet.class" } }

There is a setting to not include the project files in the assembly but I do not recall it at this moment.



2014-05-29 10:13 GMT-07:00 Andrei <[hidden email]>:
Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow).

I like the idea of using "sbt package", since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this case I'll have to handle jar list manually via Spark context. Is there a way to automate this process? E.g. when I was a Clojure guy, I could run "lein deps" (lein is a build tool similar to sbt) to download all dependencies and then just enumerate them from my app. Maybe you have heard of something like that for Spark/SBT?

Thanks,
Andrei


On Thu, May 29, 2014 at 3:48 PM, jaranda <[hidden email]> wrote:
Hi Andrei,

I think the preferred way to deploy Spark jobs is by using the sbt package
task instead of using the sbt assembly plugin. In any case, as you comment,
the mergeStrategy in combination with some dependency exlusions should fix
your problems. Have a look at  this gist
<https://gist.github.com/JordiAranda/bdbad58d128c14277a05>   for further
details (I just followed some recommendations commented in the sbt assembly
plugin documentation).

Up to now I haven't found a proper way to combine my development/deployment
phases, although I must say my experience in Spark is pretty poor (it really
depends in your deployment requirements as well). In this case, I think
someone else could give you some further insights.

Best,



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

Andrei
Thanks, Stephen. I have eventually decided to go with assembly, but put away Spark and Hadoop jars, and instead use `spark-submit` to automatically provide these dependencies. This way no resource conflicts arise and mergeStrategy needs no modification. To memorize this stable setup and also share it with the community I've crafted a project [1] with minimal working config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's Hadoop client. Hope, it will help somebody to take Spark setup quicker.

Though I'm fine with this setup for final builds, I'm still looking for a more interactive dev setup - something that doesn't require full rebuild.
Thanks and have a good weekend,
Andrei

On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch <[hidden email]> wrote:

The MergeStrategy combined with sbt assembly did work for me.  This is not painless: some trial and error and the assembly may take multiple minutes. 

You will likely want to filter out some additional classes from the generated jar file.  Here is an SOF answer to explain that and with IMHO the best answer snippet included here (in this case the OP understandably did not want to not include javax.servlet.Servlet)



mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) => ms filter { case (file, toPath) => toPath != "javax/servlet/Servlet.class" } }

There is a setting to not include the project files in the assembly but I do not recall it at this moment.



2014-05-29 10:13 GMT-07:00 Andrei <[hidden email]>:

Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow).

I like the idea of using "sbt package", since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this case I'll have to handle jar list manually via Spark context. Is there a way to automate this process? E.g. when I was a Clojure guy, I could run "lein deps" (lein is a build tool similar to sbt) to download all dependencies and then just enumerate them from my app. Maybe you have heard of something like that for Spark/SBT?

Thanks,
Andrei


On Thu, May 29, 2014 at 3:48 PM, jaranda <[hidden email]> wrote:
Hi Andrei,

I think the preferred way to deploy Spark jobs is by using the sbt package
task instead of using the sbt assembly plugin. In any case, as you comment,
the mergeStrategy in combination with some dependency exlusions should fix
your problems. Have a look at  this gist
<https://gist.github.com/JordiAranda/bdbad58d128c14277a05>   for further
details (I just followed some recommendations commented in the sbt assembly
plugin documentation).

Up to now I haven't found a proper way to combine my development/deployment
phases, although I must say my experience in Spark is pretty poor (it really
depends in your deployment requirements as well). In this case, I think
someone else could give you some further insights.

Best,



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

Ngoc Dao
Alternative solution:
https://github.com/xitrum-framework/xitrum-package

It collects all dependency .jar files in your Scala program into a
directory. It doesn't merge the .jar files together, the .jar files
are left "as is".


On Sat, May 31, 2014 at 3:42 AM, Andrei <[hidden email]> wrote:

> Thanks, Stephen. I have eventually decided to go with assembly, but put away
> Spark and Hadoop jars, and instead use `spark-submit` to automatically
> provide these dependencies. This way no resource conflicts arise and
> mergeStrategy needs no modification. To memorize this stable setup and also
> share it with the community I've crafted a project [1] with minimal working
> config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
> Hadoop client. Hope, it will help somebody to take Spark setup quicker.
>
> Though I'm fine with this setup for final builds, I'm still looking for a
> more interactive dev setup - something that doesn't require full rebuild.
>
> [1]: https://github.com/faithlessfriend/sample-spark-project
>
> Thanks and have a good weekend,
> Andrei
>
> On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch <[hidden email]> wrote:
>>
>>
>> The MergeStrategy combined with sbt assembly did work for me.  This is not
>> painless: some trial and error and the assembly may take multiple minutes.
>>
>> You will likely want to filter out some additional classes from the
>> generated jar file.  Here is an SOF answer to explain that and with IMHO the
>> best answer snippet included here (in this case the OP understandably did
>> not want to not include javax.servlet.Servlet)
>>
>> http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar
>>
>>
>> mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) => ms
>> filter { case (file, toPath) => toPath != "javax/servlet/Servlet.class" } }
>>
>> There is a setting to not include the project files in the assembly but I
>> do not recall it at this moment.
>>
>>
>>
>> 2014-05-29 10:13 GMT-07:00 Andrei <[hidden email]>:
>>
>>> Thanks, Jordi, your gist looks pretty much like what I have in my project
>>> currently (with few exceptions that I'm going to borrow).
>>>
>>> I like the idea of using "sbt package", since it doesn't require third
>>> party plugins and, most important, doesn't create a mess of classes and
>>> resources. But in this case I'll have to handle jar list manually via Spark
>>> context. Is there a way to automate this process? E.g. when I was a Clojure
>>> guy, I could run "lein deps" (lein is a build tool similar to sbt) to
>>> download all dependencies and then just enumerate them from my app. Maybe
>>> you have heard of something like that for Spark/SBT?
>>>
>>> Thanks,
>>> Andrei
>>>
>>>
>>> On Thu, May 29, 2014 at 3:48 PM, jaranda <[hidden email]> wrote:
>>>>
>>>> Hi Andrei,
>>>>
>>>> I think the preferred way to deploy Spark jobs is by using the sbt
>>>> package
>>>> task instead of using the sbt assembly plugin. In any case, as you
>>>> comment,
>>>> the mergeStrategy in combination with some dependency exlusions should
>>>> fix
>>>> your problems. Have a look at  this gist
>>>> <https://gist.github.com/JordiAranda/bdbad58d128c14277a05>   for further
>>>> details (I just followed some recommendations commented in the sbt
>>>> assembly
>>>> plugin documentation).
>>>>
>>>> Up to now I haven't found a proper way to combine my
>>>> development/deployment
>>>> phases, although I must say my experience in Spark is pretty poor (it
>>>> really
>>>> depends in your deployment requirements as well). In this case, I think
>>>> someone else could give you some further insights.
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

Andrei
Thanks! This is even closer to what I am looking for. I'm in a trip now, so I'm going to give it a try when I come back.


On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao <[hidden email]> wrote:
Alternative solution:
https://github.com/xitrum-framework/xitrum-package

It collects all dependency .jar files in your Scala program into a
directory. It doesn't merge the .jar files together, the .jar files
are left "as is".


On Sat, May 31, 2014 at 3:42 AM, Andrei <[hidden email]> wrote:
> Thanks, Stephen. I have eventually decided to go with assembly, but put away
> Spark and Hadoop jars, and instead use `spark-submit` to automatically
> provide these dependencies. This way no resource conflicts arise and
> mergeStrategy needs no modification. To memorize this stable setup and also
> share it with the community I've crafted a project [1] with minimal working
> config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
> Hadoop client. Hope, it will help somebody to take Spark setup quicker.
>
> Though I'm fine with this setup for final builds, I'm still looking for a
> more interactive dev setup - something that doesn't require full rebuild.
>
> [1]: https://github.com/faithlessfriend/sample-spark-project
>
> Thanks and have a good weekend,
> Andrei
>
> On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch <[hidden email]> wrote:
>>
>>
>> The MergeStrategy combined with sbt assembly did work for me.  This is not
>> painless: some trial and error and the assembly may take multiple minutes.
>>
>> You will likely want to filter out some additional classes from the
>> generated jar file.  Here is an SOF answer to explain that and with IMHO the
>> best answer snippet included here (in this case the OP understandably did
>> not want to not include javax.servlet.Servlet)
>>
>> http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar
>>
>>
>> mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) => ms
>> filter { case (file, toPath) => toPath != "javax/servlet/Servlet.class" } }
>>
>> There is a setting to not include the project files in the assembly but I
>> do not recall it at this moment.
>>
>>
>>
>> 2014-05-29 10:13 GMT-07:00 Andrei <[hidden email]>:
>>
>>> Thanks, Jordi, your gist looks pretty much like what I have in my project
>>> currently (with few exceptions that I'm going to borrow).
>>>
>>> I like the idea of using "sbt package", since it doesn't require third
>>> party plugins and, most important, doesn't create a mess of classes and
>>> resources. But in this case I'll have to handle jar list manually via Spark
>>> context. Is there a way to automate this process? E.g. when I was a Clojure
>>> guy, I could run "lein deps" (lein is a build tool similar to sbt) to
>>> download all dependencies and then just enumerate them from my app. Maybe
>>> you have heard of something like that for Spark/SBT?
>>>
>>> Thanks,
>>> Andrei
>>>
>>>
>>> On Thu, May 29, 2014 at 3:48 PM, jaranda <[hidden email]> wrote:
>>>>
>>>> Hi Andrei,
>>>>
>>>> I think the preferred way to deploy Spark jobs is by using the sbt
>>>> package
>>>> task instead of using the sbt assembly plugin. In any case, as you
>>>> comment,
>>>> the mergeStrategy in combination with some dependency exlusions should
>>>> fix
>>>> your problems. Have a look at  this gist
>>>> <https://gist.github.com/JordiAranda/bdbad58d128c14277a05>   for further
>>>> details (I just followed some recommendations commented in the sbt
>>>> assembly
>>>> plugin documentation).
>>>>
>>>> Up to now I haven't found a proper way to combine my
>>>> development/deployment
>>>> phases, although I must say my experience in Spark is pretty poor (it
>>>> really
>>>> depends in your deployment requirements as well). In this case, I think
>>>> someone else could give you some further insights.
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Is uberjar a recommended way of running Spark/Scala applications?

Pierre B
You might want to look at another great plugin : “sbt-pack” https://github.com/xerial/sbt-pack.

It collects all the dependencies JARs and creates launch scripts for *nix (including Mac OS) and windows.

HTH

Pierre


On 02 Jun 2014, at 17:29, Andrei <[hidden email]> wrote:

Thanks! This is even closer to what I am looking for. I'm in a trip now, so I'm going to give it a try when I come back.


On Mon, Jun 2, 2014 at 5:12 AM, Ngoc Dao <[hidden email]> wrote:
Alternative solution:
https://github.com/xitrum-framework/xitrum-package

It collects all dependency .jar files in your Scala program into a
directory. It doesn't merge the .jar files together, the .jar files
are left "as is".


On Sat, May 31, 2014 at 3:42 AM, Andrei <[hidden email]> wrote:
> Thanks, Stephen. I have eventually decided to go with assembly, but put away
> Spark and Hadoop jars, and instead use `spark-submit` to automatically
> provide these dependencies. This way no resource conflicts arise and
> mergeStrategy needs no modification. To memorize this stable setup and also
> share it with the community I've crafted a project [1] with minimal working
> config. It is SBT project with assembly plugin, Spark 1.0 and Cloudera's
> Hadoop client. Hope, it will help somebody to take Spark setup quicker.
>
> Though I'm fine with this setup for final builds, I'm still looking for a
> more interactive dev setup - something that doesn't require full rebuild.
>
> [1]: https://github.com/faithlessfriend/sample-spark-project
>
> Thanks and have a good weekend,
> Andrei
>
> On Thu, May 29, 2014 at 8:27 PM, Stephen Boesch <[hidden email]> wrote:
>>
>>
>> The MergeStrategy combined with sbt assembly did work for me.  This is not
>> painless: some trial and error and the assembly may take multiple minutes.
>>
>> You will likely want to filter out some additional classes from the
>> generated jar file.  Here is an SOF answer to explain that and with IMHO the
>> best answer snippet included here (in this case the OP understandably did
>> not want to not include javax.servlet.Servlet)
>>
>> http://stackoverflow.com/questions/7819066/sbt-exclude-class-from-jar
>>
>>
>> mappings in (Compile,packageBin) ~= { (ms: Seq[(File, String)]) => ms
>> filter { case (file, toPath) => toPath != "javax/servlet/Servlet.class" } }
>>
>> There is a setting to not include the project files in the assembly but I
>> do not recall it at this moment.
>>
>>
>>
>> 2014-05-29 10:13 GMT-07:00 Andrei <[hidden email]>:
>>
>>> Thanks, Jordi, your gist looks pretty much like what I have in my project
>>> currently (with few exceptions that I'm going to borrow).
>>>
>>> I like the idea of using "sbt package", since it doesn't require third
>>> party plugins and, most important, doesn't create a mess of classes and
>>> resources. But in this case I'll have to handle jar list manually via Spark
>>> context. Is there a way to automate this process? E.g. when I was a Clojure
>>> guy, I could run "lein deps" (lein is a build tool similar to sbt) to
>>> download all dependencies and then just enumerate them from my app. Maybe
>>> you have heard of something like that for Spark/SBT?
>>>
>>> Thanks,
>>> Andrei
>>>
>>>
>>> On Thu, May 29, 2014 at 3:48 PM, jaranda <[hidden email]> wrote:
>>>>
>>>> Hi Andrei,
>>>>
>>>> I think the preferred way to deploy Spark jobs is by using the sbt
>>>> package
>>>> task instead of using the sbt assembly plugin. In any case, as you
>>>> comment,
>>>> the mergeStrategy in combination with some dependency exlusions should
>>>> fix
>>>> your problems. Have a look at  this gist
>>>> <https://gist.github.com/JordiAranda/bdbad58d128c14277a05>   for further
>>>> details (I just followed some recommendations commented in the sbt
>>>> assembly
>>>> plugin documentation).
>>>>
>>>> Up to now I haven't found a proper way to combine my
>>>> development/deployment
>>>> phases, although I must say my experience in Spark is pretty poor (it
>>>> really
>>>> depends in your deployment requirements as well). In this case, I think
>>>> someone else could give you some further insights.
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-uberjar-a-recommended-way-of-running-Spark-Scala-applications-tp6518p6520.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>