Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

Dominique De Vito
Hi,

I am using Spark 2.1 (BTW) on YARN.

I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR.

I am trying to do so through spark-submit.

One helpful answer https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234  is the following one:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

So, I understand the following:

  • "--jars" is for uploading jar on each node
  • "--driver-class-path" is for using uploaded jar for the driver.
  • "--conf spark.executor.extraClassPath" is for using uploaded jar for executors.

While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?

The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"

Fine, but for the following command, what should I put instead of XXX and YYY ?

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path XXX:YYY \
  --conf spark.executor.extraClassPath=XXX:YYY \
  --class MyClass main-application.jar

When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?

Thanks.

Dominique

PS: I have tried

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path some1.jar:some2.jar \
  --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
  --class MyClass main-application.jar

No success (if I made no mistake)

And I have tried also:

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \ 
   --driver-class-path ./some1.jar:./some2.jar \ 
   --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \ 
   --class MyClass main-application.jar

No success either.

Reply | Threaded
Open this post in threaded view
|

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

Russell Spitzer
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. 

Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved.

Some distributions may change this behavior though, but this is the jist of it.

On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <[hidden email]> wrote:
Hi,

I am using Spark 2.1 (BTW) on YARN.

I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR.

I am trying to do so through spark-submit.

One helpful answer https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234  is the following one:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

So, I understand the following:

  • "--jars" is for uploading jar on each node
  • "--driver-class-path" is for using uploaded jar for the driver.
  • "--conf spark.executor.extraClassPath" is for using uploaded jar for executors.

While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?

The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"

Fine, but for the following command, what should I put instead of XXX and YYY ?

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path XXX:YYY \
  --conf spark.executor.extraClassPath=XXX:YYY \
  --class MyClass main-application.jar

When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?

Thanks.

Dominique

PS: I have tried

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path some1.jar:some2.jar \
  --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
  --class MyClass main-application.jar

No success (if I made no mistake)

And I have tried also:

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \ 
   --driver-class-path ./some1.jar:./some2.jar \ 
   --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \ 
   --class MyClass main-application.jar

No success either.

Reply | Threaded
Open this post in threaded view
|

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

Mich Talebzadeh
As I understand Spark expects the jar files to be available on all nodes or if applicable on HDFS directory

Putting Spark Jar files on HDFS

In Yarn mode, it is important that Spark jar files are available throughout the Spark cluster. I have spent a fair bit of time on this and I recommend that you follow this procedure to make sure that the spark-submit job runs ok. Use the spark.yarn.archive configuration option and set that to the location of an archive (you create on HDFS) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:

1) Create the archive: 
   jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
2) Create a directory on HDFS for the jars accessible to the application
   hdfs dfs -mkdir /jars
3) Upload to HDFS: 
   hdfs dfs -put spark-libs.jar /jars
4) For a large cluster, increase the replication count of the Spark archive 
   so that you reduce the amount of times a NodeManager will do a remote copy
   hdfs dfs -setrep -w 10 hdfs:///jars/spark-libs.jar (Change the amount of 
   replicas proportional to the number of total NodeManagers)
3) In $SPARK_HOME/conf/spark-defaults.conf file set
  spark.yarn.archive to hdfs:///rhes75:9000/jars/spark-libs.jar. Similar to
  below
   spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar

Every node of Spark needs to have the same $SPARK_HOME/conf/spark-defaults.conf file

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 12 Nov 2020 at 16:35, Russell Spitzer <[hidden email]> wrote:
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. 

Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved.

Some distributions may change this behavior though, but this is the jist of it.

On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <[hidden email]> wrote:
Hi,

I am using Spark 2.1 (BTW) on YARN.

I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR.

I am trying to do so through spark-submit.

One helpful answer https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234  is the following one:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

So, I understand the following:

  • "--jars" is for uploading jar on each node
  • "--driver-class-path" is for using uploaded jar for the driver.
  • "--conf spark.executor.extraClassPath" is for using uploaded jar for executors.

While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?

The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"

Fine, but for the following command, what should I put instead of XXX and YYY ?

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path XXX:YYY \
  --conf spark.executor.extraClassPath=XXX:YYY \
  --class MyClass main-application.jar

When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?

Thanks.

Dominique

PS: I have tried

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path some1.jar:some2.jar \
  --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
  --class MyClass main-application.jar

No success (if I made no mistake)

And I have tried also:

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \ 
   --driver-class-path ./some1.jar:./some2.jar \ 
   --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \ 
   --class MyClass main-application.jar

No success either.

Reply | Threaded
Open this post in threaded view
|

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

Dominique De Vito
In reply to this post by Russell Spitzer
Thanks Russell

> Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved.

Your point is interesting, however I see some discrepancy with what the Spark doc that says:

""When using spark-submit, the application jar along with any jars included with the --jars option will be automatically transferred to the cluster. URLs supplied after --jars must be separated by commas. That list is included on the driver and executor classpaths. ""

The most interesting part here (for the discussion) is "That list [from --jars] is included on the driver and executor classpaths.".

That seems a contradiction with your sentence (as you state that a jar specified by --jars can't be in the driver classpath)

... hum, I am still thinking about how to reunite both sides.

Thanks anyway

Dominique



Le jeu. 12 nov. 2020 à 17:34, Russell Spitzer <[hidden email]> a écrit :
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. 

Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved.

Some distributions may change this behavior though, but this is the jist of it.

On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <[hidden email]> wrote:
Hi,

I am using Spark 2.1 (BTW) on YARN.

I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR.

I am trying to do so through spark-submit.

One helpful answer https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234  is the following one:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

So, I understand the following:

  • "--jars" is for uploading jar on each node
  • "--driver-class-path" is for using uploaded jar for the driver.
  • "--conf spark.executor.extraClassPath" is for using uploaded jar for executors.

While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?

The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"

Fine, but for the following command, what should I put instead of XXX and YYY ?

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path XXX:YYY \
  --conf spark.executor.extraClassPath=XXX:YYY \
  --class MyClass main-application.jar

When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?

Thanks.

Dominique

PS: I have tried

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path some1.jar:some2.jar \
  --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
  --class MyClass main-application.jar

No success (if I made no mistake)

And I have tried also:

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \ 
   --driver-class-path ./some1.jar:./some2.jar \ 
   --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \ 
   --class MyClass main-application.jar

No success either.

Reply | Threaded
Open this post in threaded view
|

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

Dominique De Vito
In reply to this post by Mich Talebzadeh
Thanks Mich

To be sure, are you really saying that, using the option "spark.yarn.archive", YOU have been able to OVERRIDE installed Spark JAR with the JAR given with the option "spark.yarn.archive" ?

No more than "spark.yarn.archive" ?

Thanks

Dominique



 

Le jeu. 12 nov. 2020 à 18:01, Mich Talebzadeh <[hidden email]> a écrit :
As I understand Spark expects the jar files to be available on all nodes or if applicable on HDFS directory

Putting Spark Jar files on HDFS

In Yarn mode, it is important that Spark jar files are available throughout the Spark cluster. I have spent a fair bit of time on this and I recommend that you follow this procedure to make sure that the spark-submit job runs ok. Use the spark.yarn.archive configuration option and set that to the location of an archive (you create on HDFS) containing all the JARs in the $SPARK_HOME/jars/ folder, at the root level of the archive. For example:

1) Create the archive: 
   jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
2) Create a directory on HDFS for the jars accessible to the application
   hdfs dfs -mkdir /jars
3) Upload to HDFS: 
   hdfs dfs -put spark-libs.jar /jars
4) For a large cluster, increase the replication count of the Spark archive 
   so that you reduce the amount of times a NodeManager will do a remote copy
   hdfs dfs -setrep -w 10 hdfs:///jars/spark-libs.jar (Change the amount of 
   replicas proportional to the number of total NodeManagers)
3) In $SPARK_HOME/conf/spark-defaults.conf file set
  spark.yarn.archive to hdfs:///rhes75:9000/jars/spark-libs.jar. Similar to
  below
   spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar

Every node of Spark needs to have the same $SPARK_HOME/conf/spark-defaults.conf file

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 12 Nov 2020 at 16:35, Russell Spitzer <[hidden email]> wrote:
--driver-class-path does not move jars, so it is dependent on your Spark resource manager (master). It is interpreted literally so if your files do not exist in the location you provide relative where the driver is run, they will not be placed on the classpath. 

Since the driver is responsible for moving jars specified in --jars, you cannot use a jar specified by --jars to be in driver-class-path, since the driver is already started and it's classpath is already set before any jars are moved.

Some distributions may change this behavior though, but this is the jist of it.

On Thu, Nov 12, 2020 at 10:02 AM Dominique De Vito <[hidden email]> wrote:
Hi,

I am using Spark 2.1 (BTW) on YARN.

I am trying to upload JAR on YARN cluster, and to use them to replace on-site (alreading in-place) JAR.

I am trying to do so through spark-submit.

One helpful answer https://stackoverflow.com/questions/37132559/add-jars-to-a-spark-job-spark-submit/37348234  is the following one:

spark-submit --jars additional1.jar,additional2.jar \
  --driver-class-path additional1.jar:additional2.jar \
  --conf spark.executor.extraClassPath=additional1.jar:additional2.jar \
  --class MyClass main-application.jar

So, I understand the following:

  • "--jars" is for uploading jar on each node
  • "--driver-class-path" is for using uploaded jar for the driver.
  • "--conf spark.executor.extraClassPath" is for using uploaded jar for executors.

While I master the filepaths for "--jars" within a spark-submit command, what will be the filepaths of the uploaded JAR to be used in "--driver-class-path" for example ?

The doc says: "JARs and files are copied to the working directory for each SparkContext on the executor nodes"

Fine, but for the following command, what should I put instead of XXX and YYY ?

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path XXX:YYY \
  --conf spark.executor.extraClassPath=XXX:YYY \
  --class MyClass main-application.jar

When using spark-submit, how can I reference the "working directory for the SparkContext" to form XXX and YYY filepath ?

Thanks.

Dominique

PS: I have tried

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \
  --driver-class-path some1.jar:some2.jar \
  --conf spark.executor.extraClassPath=some1.jar:some2.jar  \
  --class MyClass main-application.jar

No success (if I made no mistake)

And I have tried also:

spark-submit --jars /a/b/some1.jar,/a/b/c/some2.jar \ 
   --driver-class-path ./some1.jar:./some2.jar \ 
   --conf spark.executor.extraClassPath=./some1.jar:./some2.jar \ 
   --class MyClass main-application.jar

No success either.