Submitting extra jars on spark applications on yarn with cluster mode

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Submitting extra jars on spark applications on yarn with cluster mode

Pedro Cardoso
Hello,

I am submitting a spark application on spark yarn using the cluster execution mode.
The application itself depends on a couple of jars. I can successfully submit and run the application using spark-submit --jars option as seen below:
spark-submit \
--name Yarn-App \
--class <FQN.Class> \
--properties-file conf/yarn.properties \
--jars lib/<first.jar>,lib/<second.jar>,lib/<third.jar> \
<application.jar> > log/yarn-app.txt 2>&1

With the yarn.properties being something like:
# Spark submit config which used in conjunction with yarn cluster mode of execution to not block spark-submit command
# for application completion.
spark.yarn.submit.waitAppCompletion=false
spark.submit.deployMode=cluster
spark.master=yarn

## General Spark Application properties
spark.driver.cores=2
spark.driver.memory=4G
spark.executor.memory=5G
spark.executor.cores=2
spark.driver.extraJavaOptions=-Xms2G
spark.driver.extraClassPath=<first.jar>:<second.jar>:<third.jar>
spark.executor.heartbeatInterval=30s

spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.minExecutors: 1
spark.dynamicAllocation.maxExecutors: 100
spark.dynamicAllocation.initialExecutors: 10
spark.kryo.referenceTracking=false
spark.kryoserializer.buffer.max=1G

spark.ui.showConsoleProgress=true
spark.yarn.am.cores=4
spark.yarn.am.memory=10G
spark.yarn.archive=<HDFS path to spark-only jars>
spark.yarn.historyServer.address=<url to history server>

However, I would like to have everyting specified in the properties file to simplify the work of my team and not force them to specify the jars every time.
So my question is what is the spark.property that replaces the spark-submit --jars parameter such that I can specify everything in properties file?

I've tried creating a tar.gz with the contents of the archive specified in spark.yarn.archive + the extra 3 jars that I need, upload that to HDFS and change the archive property but it did not work.
I got class not defined exceptions on classes that come from the 3 extra jars.

If it helps, the jars are only required for the driver not the executors. They will simply perform spark-only operations.

Thank you and have good weekend.

--

Pedro Cardoso

Research Engineer

[hidden email]


Follow Feedzai on Facebook.Follow Feedzai on Twitter!Connect with Feedzai on LinkedIn!                                                     

Feedzai best in class aite report


The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.


The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.
Reply | Threaded
Open this post in threaded view
|

Re: Submitting extra jars on spark applications on yarn with cluster mode

Artemis User

Assuming you were using hadoop for your yarn cluster.  You can specify the spark parameters spark.yarn.archive or spark.yarn.jars to contain the jar directory or jar files so that hadoop can find them by default.  See Spark online doc for details (http://spark.apache.org/docs/latest/running-on-yarn.html#adding-other-jars).  For instance: 

spark.yarn.archive              hdfs:///spark-3/jars

Please note that you will have to use the hadoop copy command to copy your jars to the HDFS before executing spark-submit (this part wasn't clear for a lot of non-hadoop users).  You may also want to load ALL spark jars to that directory in advance to speed up the launch process. You may want to contact your Hadoop admin for help.

-- ND

On 11/14/20 7:25 AM, Pedro Cardoso wrote:
Hello,

I am submitting a spark application on spark yarn using the cluster execution mode.
The application itself depends on a couple of jars. I can successfully submit and run the application using spark-submit --jars option as seen below:
spark-submit \
--name Yarn-App \
--class <FQN.Class> \
--properties-file conf/yarn.properties \
--jars lib/<first.jar>,lib/<second.jar>,lib/<third.jar> \
<application.jar> > log/yarn-app.txt 2>&1

With the yarn.properties being something like:
# Spark submit config which used in conjunction with yarn cluster mode of execution to not block spark-submit command
# for application completion.
spark.yarn.submit.waitAppCompletion=false
spark.submit.deployMode=cluster
spark.master=yarn

## General Spark Application properties
spark.driver.cores=2
spark.driver.memory=4G
spark.executor.memory=5G
spark.executor.cores=2
spark.driver.extraJavaOptions=-Xms2G
spark.driver.extraClassPath=<first.jar>:<second.jar>:<third.jar>
spark.executor.heartbeatInterval=30s

spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.minExecutors: 1
spark.dynamicAllocation.maxExecutors: 100
spark.dynamicAllocation.initialExecutors: 10
spark.kryo.referenceTracking=false
spark.kryoserializer.buffer.max=1G

spark.ui.showConsoleProgress=true
spark.yarn.am.cores=4
spark.yarn.am.memory=10G
spark.yarn.archive=<HDFS path to spark-only jars>
spark.yarn.historyServer.address=<url to history server>

However, I would like to have everyting specified in the properties file to simplify the work of my team and not force them to specify the jars every time.
So my question is what is the spark.property that replaces the spark-submit --jars parameter such that I can specify everything in properties file?

I've tried creating a tar.gz with the contents of the archive specified in spark.yarn.archive + the extra 3 jars that I need, upload that to HDFS and change the archive property but it did not work.
I got class not defined exceptions on classes that come from the 3 extra jars.

If it helps, the jars are only required for the driver not the executors. They will simply perform spark-only operations.

Thank you and have good weekend.

--

Pedro Cardoso

Research Engineer

[hidden email]


Follow Feedzai on Facebook.Follow Feedzai on Twitter!Connect with Feedzai on LinkedIn!                                                     

Feedzai best in class aite report


The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.


The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.
Reply | Threaded
Open this post in threaded view
|

Re: Submitting extra jars on spark applications on yarn with cluster mode

Artemis User

I guess I misread your message.  The archive directory shall contain only jar files, not tar.gz files...

On 11/14/20 10:11 AM, Artemis User wrote:

Assuming you were using hadoop for your yarn cluster.  You can specify the spark parameters spark.yarn.archive or spark.yarn.jars to contain the jar directory or jar files so that hadoop can find them by default.  See Spark online doc for details (http://spark.apache.org/docs/latest/running-on-yarn.html#adding-other-jars).  For instance: 

spark.yarn.archive              hdfs:///spark-3/jars

Please note that you will have to use the hadoop copy command to copy your jars to the HDFS before executing spark-submit (this part wasn't clear for a lot of non-hadoop users).  You may also want to load ALL spark jars to that directory in advance to speed up the launch process. You may want to contact your Hadoop admin for help.

-- ND

On 11/14/20 7:25 AM, Pedro Cardoso wrote:
Hello,

I am submitting a spark application on spark yarn using the cluster execution mode.
The application itself depends on a couple of jars. I can successfully submit and run the application using spark-submit --jars option as seen below:
spark-submit \
--name Yarn-App \
--class <FQN.Class> \
--properties-file conf/yarn.properties \
--jars lib/<first.jar>,lib/<second.jar>,lib/<third.jar> \
<application.jar> > log/yarn-app.txt 2>&1

With the yarn.properties being something like:
# Spark submit config which used in conjunction with yarn cluster mode of execution to not block spark-submit command
# for application completion.
spark.yarn.submit.waitAppCompletion=false
spark.submit.deployMode=cluster
spark.master=yarn

## General Spark Application properties
spark.driver.cores=2
spark.driver.memory=4G
spark.executor.memory=5G
spark.executor.cores=2
spark.driver.extraJavaOptions=-Xms2G
spark.driver.extraClassPath=<first.jar>:<second.jar>:<third.jar>
spark.executor.heartbeatInterval=30s

spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled: True
spark.dynamicAllocation.minExecutors: 1
spark.dynamicAllocation.maxExecutors: 100
spark.dynamicAllocation.initialExecutors: 10
spark.kryo.referenceTracking=false
spark.kryoserializer.buffer.max=1G

spark.ui.showConsoleProgress=true
spark.yarn.am.cores=4
spark.yarn.am.memory=10G
spark.yarn.archive=<HDFS path to spark-only jars>
spark.yarn.historyServer.address=<url to history server>

However, I would like to have everyting specified in the properties file to simplify the work of my team and not force them to specify the jars every time.
So my question is what is the spark.property that replaces the spark-submit --jars parameter such that I can specify everything in properties file?

I've tried creating a tar.gz with the contents of the archive specified in spark.yarn.archive + the extra 3 jars that I need, upload that to HDFS and change the archive property but it did not work.
I got class not defined exceptions on classes that come from the 3 extra jars.

If it helps, the jars are only required for the driver not the executors. They will simply perform spark-only operations.

Thank you and have good weekend.

--

Pedro Cardoso

Research Engineer

[hidden email]


Follow Feedzai on Facebook.Follow Feedzai on Twitter!Connect with Feedzai on LinkedIn!                                                     

Feedzai best in class aite report


The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.


The content of this email is confidential and intended for the recipient specified in message only. It is strictly prohibited to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.