Why spark-submit works with package not with jar

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

Why spark-submit works with package not with jar

Mich Talebzadeh
Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

As you can see the jar files needed are added. 


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  

So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar


has it.


Now if I remove the above jar file and replace it with the same version but package it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar --packages com.github.samelamin:spark-bigquery_2.11:0.2.6


I have read the write-ups about packages searching the maven libraries etc. Not convinced why using the package should make so much difference between a failure and success. In other words, when to use a package rather than a jar.


Any ideas will be appreciated.


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

srowen
Probably because your JAR file requires other JARs which you didn't supply. If you specify a package, it reads metadata like a pom.xml file to understand what other dependent JARs also need to be loaded.

On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

As you can see the jar files needed are added. 


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  

So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar


has it.


Now if I remove the above jar file and replace it with the same version but package it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar --packages com.github.samelamin:spark-bigquery_2.11:0.2.6


I have read the write-ups about packages searching the maven libraries etc. Not convinced why using the package should make so much difference between a failure and success. In other words, when to use a package rather than a jar.


Any ideas will be appreciated.


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Russell Spitzer
In reply to this post by Mich Talebzadeh
--jar Adds only that jar
--package adds the Jar and a it's dependencies listed in maven

On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

As you can see the jar files needed are added. 


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  

So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar


has it.


Now if I remove the above jar file and replace it with the same version but package it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar --packages com.github.samelamin:spark-bigquery_2.11:0.2.6


I have read the write-ups about packages searching the maven libraries etc. Not convinced why using the package should make so much difference between a failure and success. In other words, when to use a package rather than a jar.


Any ideas will be appreciated.


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh
Thanks Sean and Russell. Much appreciated.

Just to clarify recently I had issues with different versions of Google Guava jar files in building Uber jar file (to evict the unwanted ones). These used to work a year and half ago using Google Dataproc compute engines (comes with Spark preloaded) and I could create an Uber jar file.

Unfortunately this has become problematic now so tried to use spark-submit instead as follows:

${SPARK_HOME}/bin/spark-submit \
                --master yarn \
                --deploy-mode client \
                --conf spark.executor.memoryOverhead=3000 \
                --class org.apache.spark.repl.Main \
                --name "Spark shell on Yarn" "$@"
                --driver-class-path /home/hduser/jars/ddhybrid.jar \
                --jars /home/hduser/jars/spark-bigquery-latest.jar, \
                       /home/hduser/jars/ddhybrid.jar \
                --packages com.github.samelamin:spark-bigquery_2.11:0.2.6

Effectively tailored spark-shell. However, I do not think there is a mechanism to resolve jar conflicts without  building an Uber jar file through SBT?

Cheers



On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]> wrote:
--jar Adds only that jar
--package adds the Jar and a it's dependencies listed in maven

On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

As you can see the jar files needed are added. 


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  

So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar


has it.


Now if I remove the above jar file and replace it with the same version but package it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar --packages com.github.samelamin:spark-bigquery_2.11:0.2.6


I have read the write-ups about packages searching the maven libraries etc. Not convinced why using the package should make so much difference between a failure and success. In other words, when to use a package rather than a jar.


Any ideas will be appreciated.


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

ayan guha
Hi

One way to think of this is --packages is better when you have third party dependency and --jars is better when you have custom in-house built jars. 

On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <[hidden email]> wrote:
Thanks Sean and Russell. Much appreciated.

Just to clarify recently I had issues with different versions of Google Guava jar files in building Uber jar file (to evict the unwanted ones). These used to work a year and half ago using Google Dataproc compute engines (comes with Spark preloaded) and I could create an Uber jar file.

Unfortunately this has become problematic now so tried to use spark-submit instead as follows:

${SPARK_HOME}/bin/spark-submit \
                --master yarn \
                --deploy-mode client \
                --conf spark.executor.memoryOverhead=3000 \
                --class org.apache.spark.repl.Main \
                --name "Spark shell on Yarn" "$@"
                --driver-class-path /home/hduser/jars/ddhybrid.jar \
                --jars /home/hduser/jars/spark-bigquery-latest.jar, \
                       /home/hduser/jars/ddhybrid.jar \
                --packages com.github.samelamin:spark-bigquery_2.11:0.2.6

Effectively tailored spark-shell. However, I do not think there is a mechanism to resolve jar conflicts without  building an Uber jar file through SBT?

Cheers



On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]> wrote:
--jar Adds only that jar
--package adds the Jar and a it's dependencies listed in maven

On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a scenario that I use in Spark submit as follows:

spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar

As you can see the jar files needed are added. 


This comes back with error message as below


Creating model test.weights_MODEL

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)

  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)

  ... 76 elided

Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer

  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

  

So there is an issue with finding the class, although the jar file used


/home/hduser/jars/spark-bigquery_2.11-0.2.6.jar


has it.


Now if I remove the above jar file and replace it with the same version but package it works!


spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar --packages com.github.samelamin:spark-bigquery_2.11:0.2.6


I have read the write-ups about packages searching the maven libraries etc. Not convinced why using the package should make so much difference between a failure and success. In other words, when to use a package rather than a jar.


Any ideas will be appreciated.


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Nicolas Paris-2
once you got the jars from --package in the ~/.ivy2 folder you can then
add the list to --jars . in this way there is no missing dependency.


ayan guha <[hidden email]> writes:

> Hi
>
> One way to think of this is --packages is better when you have third party
> dependency and --jars is better when you have custom in-house built jars.
>
> On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <[hidden email]>
> wrote:
>
>> Thanks Sean and Russell. Much appreciated.
>>
>> Just to clarify recently I had issues with different versions of Google
>> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> These used to work a year and half ago using Google Dataproc compute
>> engines (comes with Spark preloaded) and I could create an Uber jar file.
>>
>> Unfortunately this has become problematic now so tried to use spark-submit
>> instead as follows:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --master yarn \
>>                 --deploy-mode client \
>>                 --conf spark.executor.memoryOverhead=3000 \
>>                 --class org.apache.spark.repl.Main \
>>                 --name "Spark shell on Yarn" "$@"
>>                 --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>                 --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>                        /home/hduser/jars/ddhybrid.jar \
>>                 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>
>> Effectively tailored spark-shell. However, I do not think there is a
>> mechanism to resolve jar conflicts without  building an Uber jar file
>> through SBT?
>>
>> Cheers
>>
>>
>>
>> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]>
>> wrote:
>>
>>> --jar Adds only that jar
>>> --package adds the Jar and a it's dependencies listed in maven
>>>
>>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>>> [hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a scenario that I use in Spark submit as follows:
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>>
>>>> As you can see the jar files needed are added.
>>>>
>>>>
>>>> This comes back with error message as below
>>>>
>>>>
>>>> Creating model test.weights_MODEL
>>>>
>>>> java.lang.NoClassDefFoundError:
>>>> com/google/api/client/http/HttpRequestInitializer
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>>
>>>>   ... 76 elided
>>>>
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> com.google.api.client.http.HttpRequestInitializer
>>>>
>>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>
>>>>
>>>>
>>>> So there is an issue with finding the class, although the jar file used
>>>>
>>>>
>>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>>
>>>> has it.
>>>>
>>>>
>>>> Now if *I remove the above jar file and replace it with the same
>>>> version but package* it works!
>>>>
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>>
>>>>
>>>> I have read the write-ups about packages searching the maven
>>>> libraries etc. Not convinced why using the package should make so much
>>>> difference between a failure and success. In other words, when to use a
>>>> package rather than a jar.
>>>>
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>> --
> Best Regards,
> Ayan Guha


--
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh
Hi Nicolas,

I removed ~/.iv2 and reran the spark job with the package included (the one working)

Under ~/.ivy/jars I Have 37 jar files, including the one that I had before. 

/home/hduser/.ivy2/jars> ls
com.databricks_spark-avro_2.11-4.0.0.jar                           com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar    com.google.oauth-client_google-oauth-client-1.24.1.jar        org.checkerframework_checker-qual-2.5.2.jar
com.fasterxml.jackson.core_jackson-core-2.9.2.jar                  com.google.cloud.bigdataoss_gcsio-1.9.4.jar                    com.google.oauth-client_google-oauth-client-java6-1.24.1.jar  org.codehaus.jackson_jackson-core-asl-1.9.13.jar
com.github.samelamin_spark-bigquery_2.11-0.2.6.jar                 com.google.cloud.bigdataoss_util-1.9.4.jar                     commons-codec_commons-codec-1.6.jar                           org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
com.google.api-client_google-api-client-1.24.1.jar                 com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar      commons-logging_commons-logging-1.1.1.jar                     org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
com.google.api-client_google-api-client-jackson2-1.24.1.jar        com.google.code.findbugs_jsr305-3.0.2.jar                      com.thoughtworks.paranamer_paranamer-2.3.jar                  org.slf4j_slf4j-api-1.7.5.jar
com.google.api-client_google-api-client-java6-1.24.1.jar           com.google.errorprone_error_prone_annotations-2.1.3.jar        joda-time_joda-time-2.9.3.jar                                 org.tukaani_xz-1.0.jar
com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar  com.google.guava_guava-26.0-jre.jar                            org.apache.avro_avro-1.7.6.jar                                org.xerial.snappy_snappy-java-1.0.5.jar
com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar   com.google.http-client_google-http-client-1.24.1.jar           org.apache.commons_commons-compress-1.4.1.jar
com.google.auto.value_auto-value-annotations-1.6.2.jar             com.google.http-client_google-http-client-jackson2-1.24.1.jar  org.apache.httpcomponents_httpclient-4.0.1.jar
com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar  com.google.j2objc_j2objc-annotations-1.1.jar                   org.apache.httpcomponents_httpcore-4.0.1.jar

I don't think I need to add all of these to spark-submit --jars list. Is there a way I can find out which dependency is missing 

This is the error I am getting when I use the jar file com.github.samelamin_spark-bigquery_2.11-0.2.6.jar instead of the package com.github.samelamin:spark-bigquery_2.11:0.2.6

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
  ... 76 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 20:09, Nicolas Paris <[hidden email]> wrote:
once you got the jars from --package in the ~/.ivy2 folder you can then
add the list to --jars . in this way there is no missing dependency.


ayan guha <[hidden email]> writes:

> Hi
>
> One way to think of this is --packages is better when you have third party
> dependency and --jars is better when you have custom in-house built jars.
>
> On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <[hidden email]>
> wrote:
>
>> Thanks Sean and Russell. Much appreciated.
>>
>> Just to clarify recently I had issues with different versions of Google
>> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> These used to work a year and half ago using Google Dataproc compute
>> engines (comes with Spark preloaded) and I could create an Uber jar file.
>>
>> Unfortunately this has become problematic now so tried to use spark-submit
>> instead as follows:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --master yarn \
>>                 --deploy-mode client \
>>                 --conf spark.executor.memoryOverhead=3000 \
>>                 --class org.apache.spark.repl.Main \
>>                 --name "Spark shell on Yarn" "$@"
>>                 --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>                 --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>                        /home/hduser/jars/ddhybrid.jar \
>>                 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>
>> Effectively tailored spark-shell. However, I do not think there is a
>> mechanism to resolve jar conflicts without  building an Uber jar file
>> through SBT?
>>
>> Cheers
>>
>>
>>
>> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]>
>> wrote:
>>
>>> --jar Adds only that jar
>>> --package adds the Jar and a it's dependencies listed in maven
>>>
>>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>>> [hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a scenario that I use in Spark submit as follows:
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>>
>>>> As you can see the jar files needed are added.
>>>>
>>>>
>>>> This comes back with error message as below
>>>>
>>>>
>>>> Creating model test.weights_MODEL
>>>>
>>>> java.lang.NoClassDefFoundError:
>>>> com/google/api/client/http/HttpRequestInitializer
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>>
>>>>   ... 76 elided
>>>>
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> com.google.api.client.http.HttpRequestInitializer
>>>>
>>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>
>>>>
>>>>
>>>> So there is an issue with finding the class, although the jar file used
>>>>
>>>>
>>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>>
>>>> has it.
>>>>
>>>>
>>>> Now if *I remove the above jar file and replace it with the same
>>>> version but package* it works!
>>>>
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>>
>>>>
>>>> I have read the write-ups about packages searching the maven
>>>> libraries etc. Not convinced why using the package should make so much
>>>> difference between a failure and success. In other words, when to use a
>>>> package rather than a jar.
>>>>
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>> --
> Best Regards,
> Ayan Guha


--
nicolas paris
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Nicolas Paris-2
you can proceed step by step.

> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer

I would run `grep -lRi HttpRequestInitializer` in the ivy2 folder to
spot the jar containing that class. after several other class not found you
should succeed

Mich Talebzadeh <[hidden email]> writes:

> Hi Nicolas,
>
> I removed ~/.iv2 and reran the spark job with the package included (the one
> working)
>
> Under ~/.ivy/jars I Have 37 jar files, including the one that I had before.
>
> /home/hduser/.ivy2/jars> ls
> com.databricks_spark-avro_2.11-4.0.0.jar
>  com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar
> com.google.oauth-client_google-oauth-client-1.24.1.jar
> org.checkerframework_checker-qual-2.5.2.jar
> com.fasterxml.jackson.core_jackson-core-2.9.2.jar
> com.google.cloud.bigdataoss_gcsio-1.9.4.jar
> com.google.oauth-client_google-oauth-client-java6-1.24.1.jar
> org.codehaus.jackson_jackson-core-asl-1.9.13.jar
> com.github.samelamin_spark-bigquery_2.11-0.2.6.jar
>  com.google.cloud.bigdataoss_util-1.9.4.jar
>  commons-codec_commons-codec-1.6.jar
>  org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
> com.google.api-client_google-api-client-1.24.1.jar
>  com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar
> commons-logging_commons-logging-1.1.1.jar
>  org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
> com.google.api-client_google-api-client-jackson2-1.24.1.jar
> com.google.code.findbugs_jsr305-3.0.2.jar
> com.thoughtworks.paranamer_paranamer-2.3.jar
> org.slf4j_slf4j-api-1.7.5.jar
> com.google.api-client_google-api-client-java6-1.24.1.jar
>  com.google.errorprone_error_prone_annotations-2.1.3.jar
> joda-time_joda-time-2.9.3.jar
>  org.tukaani_xz-1.0.jar
> com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar
> com.google.guava_guava-26.0-jre.jar
> org.apache.avro_avro-1.7.6.jar
> org.xerial.snappy_snappy-java-1.0.5.jar
> com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar
>  com.google.http-client_google-http-client-1.24.1.jar
>  org.apache.commons_commons-compress-1.4.1.jar
> com.google.auto.value_auto-value-annotations-1.6.2.jar
>  com.google.http-client_google-http-client-jackson2-1.24.1.jar
> org.apache.httpcomponents_httpclient-4.0.1.jar
> com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar
> com.google.j2objc_j2objc-annotations-1.1.jar
>  org.apache.httpcomponents_httpcore-4.0.1.jar
>
> I don't think I need to add all of these to spark-submit --jars list. Is
> there a way I can find out which dependency is missing
>
> This is the error I am getting when I use the jar file
> * com.github.samelamin_spark-bigquery_2.11-0.2.6.jar* instead of the
> package *com.github.samelamin:spark-bigquery_2.11:0.2.6*
>
> java.lang.NoClassDefFoundError:
> com/google/api/client/http/HttpRequestInitializer
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>   at
> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>   ... 76 elided
> Caused by: java.lang.ClassNotFoundException:
> com.google.api.client.http.HttpRequestInitializer
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
>
> Thanks
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 20 Oct 2020 at 20:09, Nicolas Paris <[hidden email]>
> wrote:
>
>> once you got the jars from --package in the ~/.ivy2 folder you can then
>> add the list to --jars . in this way there is no missing dependency.
>>
>>
>> ayan guha <[hidden email]> writes:
>>
>> > Hi
>> >
>> > One way to think of this is --packages is better when you have third
>> party
>> > dependency and --jars is better when you have custom in-house built jars.
>> >
>> > On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <
>> [hidden email]>
>> > wrote:
>> >
>> >> Thanks Sean and Russell. Much appreciated.
>> >>
>> >> Just to clarify recently I had issues with different versions of Google
>> >> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> >> These used to work a year and half ago using Google Dataproc compute
>> >> engines (comes with Spark preloaded) and I could create an Uber jar
>> file.
>> >>
>> >> Unfortunately this has become problematic now so tried to use
>> spark-submit
>> >> instead as follows:
>> >>
>> >> ${SPARK_HOME}/bin/spark-submit \
>> >>                 --master yarn \
>> >>                 --deploy-mode client \
>> >>                 --conf spark.executor.memoryOverhead=3000 \
>> >>                 --class org.apache.spark.repl.Main \
>> >>                 --name "Spark shell on Yarn" "$@"
>> >>                 --driver-class-path /home/hduser/jars/ddhybrid.jar \
>> >>                 --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>> >>                        /home/hduser/jars/ddhybrid.jar \
>> >>                 --packages
>> com.github.samelamin:spark-bigquery_2.11:0.2.6
>> >>
>> >> Effectively tailored spark-shell. However, I do not think there is a
>> >> mechanism to resolve jar conflicts without  building an Uber jar file
>> >> through SBT?
>> >>
>> >> Cheers
>> >>
>> >>
>> >>
>> >> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <
>> [hidden email]>
>> >> wrote:
>> >>
>> >>> --jar Adds only that jar
>> >>> --package adds the Jar and a it's dependencies listed in maven
>> >>>
>> >>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>> >>> [hidden email]> wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> I have a scenario that I use in Spark submit as follows:
>> >>>>
>> >>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> >>>>
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>> >>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>> >>>>
>> >>>> As you can see the jar files needed are added.
>> >>>>
>> >>>>
>> >>>> This comes back with error message as below
>> >>>>
>> >>>>
>> >>>> Creating model test.weights_MODEL
>> >>>>
>> >>>> java.lang.NoClassDefFoundError:
>> >>>> com/google/api/client/http/HttpRequestInitializer
>> >>>>
>> >>>>   at
>> >>>>
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>> >>>>
>> >>>>   at
>> >>>>
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>> >>>>
>> >>>>   at
>> >>>>
>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>> >>>>
>> >>>>   ... 76 elided
>> >>>>
>> >>>> Caused by: java.lang.ClassNotFoundException:
>> >>>> com.google.api.client.http.HttpRequestInitializer
>> >>>>
>> >>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>> >>>>
>> >>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> >>>>
>> >>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> >>>>
>> >>>>
>> >>>>
>> >>>> So there is an issue with finding the class, although the jar file
>> used
>> >>>>
>> >>>>
>> >>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>> >>>>
>> >>>> has it.
>> >>>>
>> >>>>
>> >>>> Now if *I remove the above jar file and replace it with the same
>> >>>> version but package* it works!
>> >>>>
>> >>>>
>> >>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> >>>>
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>> >>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>> >>>>
>> >>>>
>> >>>> I have read the write-ups about packages searching the maven
>> >>>> libraries etc. Not convinced why using the package should make so much
>> >>>> difference between a failure and success. In other words, when to use
>> a
>> >>>> package rather than a jar.
>> >>>>
>> >>>>
>> >>>> Any ideas will be appreciated.
>> >>>>
>> >>>>
>> >>>> Thanks
>> >>>>
>> >>>>
>> >>>>
>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> >>>> any loss, damage or destruction of data or any other property which
>> may
>> >>>> arise from relying on this email's technical content is explicitly
>> >>>> disclaimed. The author will in no case be liable for any monetary
>> damages
>> >>>> arising from such loss, damage or destruction.
>> >>>>
>> >>>>
>> >>>>
>> >>> --
>> > Best Regards,
>> > Ayan Guha
>>
>>
>> --
>> nicolas paris
>>


--
nicolas paris

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

srowen
In reply to this post by Mich Talebzadeh
From the looks of it, it's the com.google.http-client ones. But there may be more. You should not have to reason about this. That's why you let Maven / Ivy resolution figure it out. It is not true that everything in .ivy2 is on the classpath.

On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh <[hidden email]> wrote:
Hi Nicolas,

I removed ~/.iv2 and reran the spark job with the package included (the one working)

Under ~/.ivy/jars I Have 37 jar files, including the one that I had before. 

/home/hduser/.ivy2/jars> ls
com.databricks_spark-avro_2.11-4.0.0.jar                           com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar    com.google.oauth-client_google-oauth-client-1.24.1.jar        org.checkerframework_checker-qual-2.5.2.jar
com.fasterxml.jackson.core_jackson-core-2.9.2.jar                  com.google.cloud.bigdataoss_gcsio-1.9.4.jar                    com.google.oauth-client_google-oauth-client-java6-1.24.1.jar  org.codehaus.jackson_jackson-core-asl-1.9.13.jar
com.github.samelamin_spark-bigquery_2.11-0.2.6.jar                 com.google.cloud.bigdataoss_util-1.9.4.jar                     commons-codec_commons-codec-1.6.jar                           org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
com.google.api-client_google-api-client-1.24.1.jar                 com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar      commons-logging_commons-logging-1.1.1.jar                     org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
com.google.api-client_google-api-client-jackson2-1.24.1.jar        com.google.code.findbugs_jsr305-3.0.2.jar                      com.thoughtworks.paranamer_paranamer-2.3.jar                  org.slf4j_slf4j-api-1.7.5.jar
com.google.api-client_google-api-client-java6-1.24.1.jar           com.google.errorprone_error_prone_annotations-2.1.3.jar        joda-time_joda-time-2.9.3.jar                                 org.tukaani_xz-1.0.jar
com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar  com.google.guava_guava-26.0-jre.jar                            org.apache.avro_avro-1.7.6.jar                                org.xerial.snappy_snappy-java-1.0.5.jar
com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar   com.google.http-client_google-http-client-1.24.1.jar           org.apache.commons_commons-compress-1.4.1.jar
com.google.auto.value_auto-value-annotations-1.6.2.jar             com.google.http-client_google-http-client-jackson2-1.24.1.jar  org.apache.httpcomponents_httpclient-4.0.1.jar
com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar  com.google.j2objc_j2objc-annotations-1.1.jar                   org.apache.httpcomponents_httpcore-4.0.1.jar

I don't think I need to add all of these to spark-submit --jars list. Is there a way I can find out which dependency is missing 

This is the error I am getting when I use the jar file com.github.samelamin_spark-bigquery_2.11-0.2.6.jar instead of the package com.github.samelamin:spark-bigquery_2.11:0.2.6

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
  ... 76 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 20:09, Nicolas Paris <[hidden email]> wrote:
once you got the jars from --package in the ~/.ivy2 folder you can then
add the list to --jars . in this way there is no missing dependency.


ayan guha <[hidden email]> writes:

> Hi
>
> One way to think of this is --packages is better when you have third party
> dependency and --jars is better when you have custom in-house built jars.
>
> On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <[hidden email]>
> wrote:
>
>> Thanks Sean and Russell. Much appreciated.
>>
>> Just to clarify recently I had issues with different versions of Google
>> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> These used to work a year and half ago using Google Dataproc compute
>> engines (comes with Spark preloaded) and I could create an Uber jar file.
>>
>> Unfortunately this has become problematic now so tried to use spark-submit
>> instead as follows:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --master yarn \
>>                 --deploy-mode client \
>>                 --conf spark.executor.memoryOverhead=3000 \
>>                 --class org.apache.spark.repl.Main \
>>                 --name "Spark shell on Yarn" "$@"
>>                 --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>                 --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>                        /home/hduser/jars/ddhybrid.jar \
>>                 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>
>> Effectively tailored spark-shell. However, I do not think there is a
>> mechanism to resolve jar conflicts without  building an Uber jar file
>> through SBT?
>>
>> Cheers
>>
>>
>>
>> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]>
>> wrote:
>>
>>> --jar Adds only that jar
>>> --package adds the Jar and a it's dependencies listed in maven
>>>
>>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>>> [hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a scenario that I use in Spark submit as follows:
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>>
>>>> As you can see the jar files needed are added.
>>>>
>>>>
>>>> This comes back with error message as below
>>>>
>>>>
>>>> Creating model test.weights_MODEL
>>>>
>>>> java.lang.NoClassDefFoundError:
>>>> com/google/api/client/http/HttpRequestInitializer
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>>
>>>>   ... 76 elided
>>>>
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> com.google.api.client.http.HttpRequestInitializer
>>>>
>>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>
>>>>
>>>>
>>>> So there is an issue with finding the class, although the jar file used
>>>>
>>>>
>>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>>
>>>> has it.
>>>>
>>>>
>>>> Now if *I remove the above jar file and replace it with the same
>>>> version but package* it works!
>>>>
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>>
>>>>
>>>> I have read the write-ups about packages searching the maven
>>>> libraries etc. Not convinced why using the package should make so much
>>>> difference between a failure and success. In other words, when to use a
>>>> package rather than a jar.
>>>>
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>> --
> Best Regards,
> Ayan Guha


--
nicolas paris
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 22:34, Sean Owen <[hidden email]> wrote:
From the looks of it, it's the com.google.http-client ones. But there may be more. You should not have to reason about this. That's why you let Maven / Ivy resolution figure it out. It is not true that everything in .ivy2 is on the classpath.

On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh <[hidden email]> wrote:
Hi Nicolas,

I removed ~/.iv2 and reran the spark job with the package included (the one working)

Under ~/.ivy/jars I Have 37 jar files, including the one that I had before. 

/home/hduser/.ivy2/jars> ls
com.databricks_spark-avro_2.11-4.0.0.jar                           com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar    com.google.oauth-client_google-oauth-client-1.24.1.jar        org.checkerframework_checker-qual-2.5.2.jar
com.fasterxml.jackson.core_jackson-core-2.9.2.jar                  com.google.cloud.bigdataoss_gcsio-1.9.4.jar                    com.google.oauth-client_google-oauth-client-java6-1.24.1.jar  org.codehaus.jackson_jackson-core-asl-1.9.13.jar
com.github.samelamin_spark-bigquery_2.11-0.2.6.jar                 com.google.cloud.bigdataoss_util-1.9.4.jar                     commons-codec_commons-codec-1.6.jar                           org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
com.google.api-client_google-api-client-1.24.1.jar                 com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar      commons-logging_commons-logging-1.1.1.jar                     org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
com.google.api-client_google-api-client-jackson2-1.24.1.jar        com.google.code.findbugs_jsr305-3.0.2.jar                      com.thoughtworks.paranamer_paranamer-2.3.jar                  org.slf4j_slf4j-api-1.7.5.jar
com.google.api-client_google-api-client-java6-1.24.1.jar           com.google.errorprone_error_prone_annotations-2.1.3.jar        joda-time_joda-time-2.9.3.jar                                 org.tukaani_xz-1.0.jar
com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar  com.google.guava_guava-26.0-jre.jar                            org.apache.avro_avro-1.7.6.jar                                org.xerial.snappy_snappy-java-1.0.5.jar
com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar   com.google.http-client_google-http-client-1.24.1.jar           org.apache.commons_commons-compress-1.4.1.jar
com.google.auto.value_auto-value-annotations-1.6.2.jar             com.google.http-client_google-http-client-jackson2-1.24.1.jar  org.apache.httpcomponents_httpclient-4.0.1.jar
com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar  com.google.j2objc_j2objc-annotations-1.1.jar                   org.apache.httpcomponents_httpcore-4.0.1.jar

I don't think I need to add all of these to spark-submit --jars list. Is there a way I can find out which dependency is missing 

This is the error I am getting when I use the jar file com.github.samelamin_spark-bigquery_2.11-0.2.6.jar instead of the package com.github.samelamin:spark-bigquery_2.11:0.2.6

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
  ... 76 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 20:09, Nicolas Paris <[hidden email]> wrote:
once you got the jars from --package in the ~/.ivy2 folder you can then
add the list to --jars . in this way there is no missing dependency.


ayan guha <[hidden email]> writes:

> Hi
>
> One way to think of this is --packages is better when you have third party
> dependency and --jars is better when you have custom in-house built jars.
>
> On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <[hidden email]>
> wrote:
>
>> Thanks Sean and Russell. Much appreciated.
>>
>> Just to clarify recently I had issues with different versions of Google
>> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> These used to work a year and half ago using Google Dataproc compute
>> engines (comes with Spark preloaded) and I could create an Uber jar file.
>>
>> Unfortunately this has become problematic now so tried to use spark-submit
>> instead as follows:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --master yarn \
>>                 --deploy-mode client \
>>                 --conf spark.executor.memoryOverhead=3000 \
>>                 --class org.apache.spark.repl.Main \
>>                 --name "Spark shell on Yarn" "$@"
>>                 --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>                 --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>                        /home/hduser/jars/ddhybrid.jar \
>>                 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>
>> Effectively tailored spark-shell. However, I do not think there is a
>> mechanism to resolve jar conflicts without  building an Uber jar file
>> through SBT?
>>
>> Cheers
>>
>>
>>
>> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]>
>> wrote:
>>
>>> --jar Adds only that jar
>>> --package adds the Jar and a it's dependencies listed in maven
>>>
>>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>>> [hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a scenario that I use in Spark submit as follows:
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>>
>>>> As you can see the jar files needed are added.
>>>>
>>>>
>>>> This comes back with error message as below
>>>>
>>>>
>>>> Creating model test.weights_MODEL
>>>>
>>>> java.lang.NoClassDefFoundError:
>>>> com/google/api/client/http/HttpRequestInitializer
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>>
>>>>   ... 76 elided
>>>>
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> com.google.api.client.http.HttpRequestInitializer
>>>>
>>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>
>>>>
>>>>
>>>> So there is an issue with finding the class, although the jar file used
>>>>
>>>>
>>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>>
>>>> has it.
>>>>
>>>>
>>>> Now if *I remove the above jar file and replace it with the same
>>>> version but package* it works!
>>>>
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>>
>>>>
>>>> I have read the write-ups about packages searching the maven
>>>> libraries etc. Not convinced why using the package should make so much
>>>> difference between a failure and success. In other words, when to use a
>>>> package rather than a jar.
>>>>
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>> --
> Best Regards,
> Ayan Guha


--
nicolas paris
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh
or just use mvn or sbt to create an Uber jar file.




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 22:43, Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 22:34, Sean Owen <[hidden email]> wrote:
From the looks of it, it's the com.google.http-client ones. But there may be more. You should not have to reason about this. That's why you let Maven / Ivy resolution figure it out. It is not true that everything in .ivy2 is on the classpath.

On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh <[hidden email]> wrote:
Hi Nicolas,

I removed ~/.iv2 and reran the spark job with the package included (the one working)

Under ~/.ivy/jars I Have 37 jar files, including the one that I had before. 

/home/hduser/.ivy2/jars> ls
com.databricks_spark-avro_2.11-4.0.0.jar                           com.google.cloud.bigdataoss_gcs-connector-1.9.4-hadoop2.jar    com.google.oauth-client_google-oauth-client-1.24.1.jar        org.checkerframework_checker-qual-2.5.2.jar
com.fasterxml.jackson.core_jackson-core-2.9.2.jar                  com.google.cloud.bigdataoss_gcsio-1.9.4.jar                    com.google.oauth-client_google-oauth-client-java6-1.24.1.jar  org.codehaus.jackson_jackson-core-asl-1.9.13.jar
com.github.samelamin_spark-bigquery_2.11-0.2.6.jar                 com.google.cloud.bigdataoss_util-1.9.4.jar                     commons-codec_commons-codec-1.6.jar                           org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
com.google.api-client_google-api-client-1.24.1.jar                 com.google.cloud.bigdataoss_util-hadoop-1.9.4-hadoop2.jar      commons-logging_commons-logging-1.1.1.jar                     org.codehaus.mojo_animal-sniffer-annotations-1.14.jar
com.google.api-client_google-api-client-jackson2-1.24.1.jar        com.google.code.findbugs_jsr305-3.0.2.jar                      com.thoughtworks.paranamer_paranamer-2.3.jar                  org.slf4j_slf4j-api-1.7.5.jar
com.google.api-client_google-api-client-java6-1.24.1.jar           com.google.errorprone_error_prone_annotations-2.1.3.jar        joda-time_joda-time-2.9.3.jar                                 org.tukaani_xz-1.0.jar
com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar  com.google.guava_guava-26.0-jre.jar                            org.apache.avro_avro-1.7.6.jar                                org.xerial.snappy_snappy-java-1.0.5.jar
com.google.apis_google-api-services-storage-v1-rev135-1.24.1.jar   com.google.http-client_google-http-client-1.24.1.jar           org.apache.commons_commons-compress-1.4.1.jar
com.google.auto.value_auto-value-annotations-1.6.2.jar             com.google.http-client_google-http-client-jackson2-1.24.1.jar  org.apache.httpcomponents_httpclient-4.0.1.jar
com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar  com.google.j2objc_j2objc-annotations-1.1.jar                   org.apache.httpcomponents_httpcore-4.0.1.jar

I don't think I need to add all of these to spark-submit --jars list. Is there a way I can find out which dependency is missing 

This is the error I am getting when I use the jar file com.github.samelamin_spark-bigquery_2.11-0.2.6.jar instead of the package com.github.samelamin:spark-bigquery_2.11:0.2.6

java.lang.NoClassDefFoundError: com/google/api/client/http/HttpRequestInitializer
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
  at com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
  ... 76 elided
Caused by: java.lang.ClassNotFoundException: com.google.api.client.http.HttpRequestInitializer
  at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)


Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 20 Oct 2020 at 20:09, Nicolas Paris <[hidden email]> wrote:
once you got the jars from --package in the ~/.ivy2 folder you can then
add the list to --jars . in this way there is no missing dependency.


ayan guha <[hidden email]> writes:

> Hi
>
> One way to think of this is --packages is better when you have third party
> dependency and --jars is better when you have custom in-house built jars.
>
> On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh <[hidden email]>
> wrote:
>
>> Thanks Sean and Russell. Much appreciated.
>>
>> Just to clarify recently I had issues with different versions of Google
>> Guava jar files in building Uber jar file (to evict the unwanted ones).
>> These used to work a year and half ago using Google Dataproc compute
>> engines (comes with Spark preloaded) and I could create an Uber jar file.
>>
>> Unfortunately this has become problematic now so tried to use spark-submit
>> instead as follows:
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>                 --master yarn \
>>                 --deploy-mode client \
>>                 --conf spark.executor.memoryOverhead=3000 \
>>                 --class org.apache.spark.repl.Main \
>>                 --name "Spark shell on Yarn" "$@"
>>                 --driver-class-path /home/hduser/jars/ddhybrid.jar \
>>                 --jars /home/hduser/jars/spark-bigquery-latest.jar, \
>>                        /home/hduser/jars/ddhybrid.jar \
>>                 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
>>
>> Effectively tailored spark-shell. However, I do not think there is a
>> mechanism to resolve jar conflicts without  building an Uber jar file
>> through SBT?
>>
>> Cheers
>>
>>
>>
>> On Tue, 20 Oct 2020 at 16:54, Russell Spitzer <[hidden email]>
>> wrote:
>>
>>> --jar Adds only that jar
>>> --package adds the Jar and a it's dependencies listed in maven
>>>
>>> On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh <
>>> [hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a scenario that I use in Spark submit as follows:
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>>
>>>> As you can see the jar files needed are added.
>>>>
>>>>
>>>> This comes back with error message as below
>>>>
>>>>
>>>> Creating model test.weights_MODEL
>>>>
>>>> java.lang.NoClassDefFoundError:
>>>> com/google/api/client/http/HttpRequestInitializer
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>>
>>>>   at
>>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>>
>>>>   ... 76 elided
>>>>
>>>> Caused by: java.lang.ClassNotFoundException:
>>>> com.google.api.client.http.HttpRequestInitializer
>>>>
>>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>
>>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>
>>>>
>>>>
>>>> So there is an issue with finding the class, although the jar file used
>>>>
>>>>
>>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>>
>>>> has it.
>>>>
>>>>
>>>> Now if *I remove the above jar file and replace it with the same
>>>> version but package* it works!
>>>>
>>>>
>>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>>
>>>>
>>>> I have read the write-ups about packages searching the maven
>>>> libraries etc. Not convinced why using the package should make so much
>>>> difference between a failure and success. In other words, when to use a
>>>> package rather than a jar.
>>>>
>>>>
>>>> Any ideas will be appreciated.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>> --
> Best Regards,
> Ayan Guha


--
nicolas paris
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

srowen
In reply to this post by Mich Talebzadeh
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need.
There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of --packages and Ivy.

On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh
Thanks again all.

Anyway as Nicola suggested I used the trench war approach to sort this out by just using jars and working out their dependencies in ~/.ivy2/jars directory using grep -lRi <missing> :)


This now works with just using jars (new added ones in grey) after resolving the dependencies


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \

                       /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar, \

                       /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \


Compared to using the package itself as before


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar \                     

                --packages com.github.samelamin:spark-bigquery_2.11:0.2.6



I think as Sean suggested this approach may or may not work (a manual process) and if jars change, the whole thing has to be re-evaluated adding to the complexity. 


Cheers 



On Tue, 20 Oct 2020 at 23:01, Sean Owen <[hidden email]> wrote:
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need.
There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of --packages and Ivy.

On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Wim Van Leuven
Sean, 

Problem with the -packages is that in enterprise settings security might not allow the data environment to link to the internet or even the internal proxying artefect repository.

Also, wasn't uberjars an antipattern? For some reason I don't like them...

Kind regards
-wim



On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Anyway as Nicola suggested I used the trench war approach to sort this out by just using jars and working out their dependencies in ~/.ivy2/jars directory using grep -lRi <missing> :)


This now works with just using jars (new added ones in grey) after resolving the dependencies


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \

                       /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar, \

                       /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \


Compared to using the package itself as before


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar \                     

                --packages com.github.samelamin:spark-bigquery_2.11:0.2.6



I think as Sean suggested this approach may or may not work (a manual process) and if jars change, the whole thing has to be re-evaluated adding to the complexity. 


Cheers 



On Tue, 20 Oct 2020 at 23:01, Sean Owen <[hidden email]> wrote:
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need.
There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of --packages and Ivy.

On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh

Hi Wim,


This is an issue DEV/OPS face all the time. Cannot access the internet behind the company firewall. There is Nexus for this that manages dependencies with usual load times in seconds. However, only authorised accounts can request it through a service account. I concur it is messy.

cheers,


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 21 Oct 2020 at 06:34, Wim Van Leuven <[hidden email]> wrote:
Sean, 

Problem with the -packages is that in enterprise settings security might not allow the data environment to link to the internet or even the internal proxying artefect repository.

Also, wasn't uberjars an antipattern? For some reason I don't like them...

Kind regards
-wim



On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Anyway as Nicola suggested I used the trench war approach to sort this out by just using jars and working out their dependencies in ~/.ivy2/jars directory using grep -lRi <missing> :)


This now works with just using jars (new added ones in grey) after resolving the dependencies


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \

                       /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar, \

                       /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \


Compared to using the package itself as before


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar \                     

                --packages com.github.samelamin:spark-bigquery_2.11:0.2.6



I think as Sean suggested this approach may or may not work (a manual process) and if jars change, the whole thing has to be re-evaluated adding to the complexity. 


Cheers 



On Tue, 20 Oct 2020 at 23:01, Sean Owen <[hidden email]> wrote:
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need.
There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of --packages and Ivy.

On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Wim Van Leuven
I like an artefact repo as the proper solution. Problem with environments that haven't yet fully embraced devops: artefact repos are considered development tools and are often not yet used to promote packages to production, air gapped if necessary.
-wim

On Wed, 21 Oct 2020 at 19:00, Mich Talebzadeh <[hidden email]> wrote:

Hi Wim,


This is an issue DEV/OPS face all the time. Cannot access the internet behind the company firewall. There is Nexus for this that manages dependencies with usual load times in seconds. However, only authorised accounts can request it through a service account. I concur it is messy.

cheers,


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 21 Oct 2020 at 06:34, Wim Van Leuven <[hidden email]> wrote:
Sean, 

Problem with the -packages is that in enterprise settings security might not allow the data environment to link to the internet or even the internal proxying artefect repository.

Also, wasn't uberjars an antipattern? For some reason I don't like them...

Kind regards
-wim



On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Anyway as Nicola suggested I used the trench war approach to sort this out by just using jars and working out their dependencies in ~/.ivy2/jars directory using grep -lRi <missing> :)


This now works with just using jars (new added ones in grey) after resolving the dependencies


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.http-client_google-http-client-jackson2-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_util-1.9.4.jar, \

                       /home/hduser/jars/com.google.api-client_google-api-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.oauth-client_google-oauth-client-1.24.1.jar, \

                       /home/hduser/jars/com.google.apis_google-api-services-bigquery-v2-rev398-1.24.1.jar, \

                       /home/hduser/jars/com.google.cloud.bigdataoss_bigquery-connector-0.13.4-hadoop2.jar, \

                       /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar \


Compared to using the package itself as before


${SPARK_HOME}/bin/spark-submit \

                --master yarn \

                --deploy-mode client \

                --conf spark.executor.memoryOverhead=3000 \

                --class org.apache.spark.repl.Main \

                --name "my own Spark shell on Yarn" "$@" \

                --driver-class-path /home/hduser/jars/ddhybrid.jar \

                --jars /home/hduser/jars/spark-bigquery-latest.jar, \

                       /home/hduser/jars/ddhybrid.jar \                     

                --packages com.github.samelamin:spark-bigquery_2.11:0.2.6



I think as Sean suggested this approach may or may not work (a manual process) and if jars change, the whole thing has to be re-evaluated adding to the complexity. 


Cheers 



On Tue, 20 Oct 2020 at 23:01, Sean Owen <[hidden email]> wrote:
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need.
There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of --packages and Ivy.

On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh <[hidden email]> wrote:
Thanks again all.

Hi Sean,

As I understood from your statement, you are suggesting just use --packages without worrying about individual jar dependencies?

Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

srowen
Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy to resolve dependencies (and of course excluding 'provided' dependencies like Spark), and push that to production. That gives you a static artifact to run that does not depend on external repo access in production.

On Wed, Oct 21, 2020 at 1:15 PM Wim Van Leuven <[hidden email]> wrote:
I like an artefact repo as the proper solution. Problem with environments that haven't yet fully embraced devops: artefact repos are considered development tools and are often not yet used to promote packages to production, air gapped if necessary.
-wim
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Mich Talebzadeh
How about PySpark? What process can that go through to not depend on external repo access in production


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 21 Oct 2020 at 19:19, Sean Owen <[hidden email]> wrote:
Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy to resolve dependencies (and of course excluding 'provided' dependencies like Spark), and push that to production. That gives you a static artifact to run that does not depend on external repo access in production.

On Wed, Oct 21, 2020 at 1:15 PM Wim Van Leuven <[hidden email]> wrote:
I like an artefact repo as the proper solution. Problem with environments that haven't yet fully embraced devops: artefact repos are considered development tools and are often not yet used to promote packages to production, air gapped if necessary.
-wim
Reply | Threaded
Open this post in threaded view
|

Re: Why spark-submit works with package not with jar

Wim Van Leuven
We actually zipped the full conda environments  during our build and ship those

On Wed, 21 Oct 2020 at 20:25, Mich Talebzadeh <[hidden email]> wrote:
How about PySpark? What process can that go through to not depend on external repo access in production

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 21 Oct 2020 at 19:19, Sean Owen <[hidden email]> wrote:
Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy to resolve dependencies (and of course excluding 'provided' dependencies like Spark), and push that to production. That gives you a static artifact to run that does not depend on external repo access in production.

On Wed, Oct 21, 2020 at 1:15 PM Wim Van Leuven <[hidden email]> wrote:
I like an artefact repo as the proper solution. Problem with environments that haven't yet fully embraced devops: artefact repos are considered development tools and are often not yet used to promote packages to production, air gapped if necessary.
-wim