Submitting job with external dependencies to pyspark

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Submitting job with external dependencies to pyspark

Tharindu Mathew
Hi,

Newbie to pyspark/spark here.

I'm trying to submit a job to pyspark with a dependency. Spark DL in this case. While the local environment has this the pyspark does not see it. How do I correctly start pyspark so that it sees this dependency? 

Using Spark 2.3.0 in a cloudera setup.

--
Regards,
Tharindu Mathew
Reply | Threaded
Open this post in threaded view
|

Re: Submitting job with external dependencies to pyspark

Chris Teoh

On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, <[hidden email]> wrote:
Hi,

Newbie to pyspark/spark here.

I'm trying to submit a job to pyspark with a dependency. Spark DL in this case. While the local environment has this the pyspark does not see it. How do I correctly start pyspark so that it sees this dependency? 

Using Spark 2.3.0 in a cloudera setup.

--
Regards,
Tharindu Mathew
Reply | Threaded
Open this post in threaded view
|

Re: Submitting job with external dependencies to pyspark

Tharindu Mathew
That was really helpful. Thanks! I actually solved my problem using by creating a venv and using the venv flags. Wondering now how to submit the data as an archive? Any idea?

On Mon, Jan 27, 2020, 9:25 PM Chris Teoh <[hidden email]> wrote:

On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, <[hidden email]> wrote:
Hi,

Newbie to pyspark/spark here.

I'm trying to submit a job to pyspark with a dependency. Spark DL in this case. While the local environment has this the pyspark does not see it. How do I correctly start pyspark so that it sees this dependency? 

Using Spark 2.3.0 in a cloudera setup.

--
Regards,
Tharindu Mathew
Reply | Threaded
Open this post in threaded view
|

Re: Submitting job with external dependencies to pyspark

Chris Teoh
Usually this isn't done as the data is meant to be on a shared/distributed storage, eg HDFS, S3, etc.

Spark should then read this data into a dataframe and your code logic applies to the dataframe in a distributed manner.

On Wed, 29 Jan 2020 at 09:37, Tharindu Mathew <[hidden email]> wrote:
That was really helpful. Thanks! I actually solved my problem using by creating a venv and using the venv flags. Wondering now how to submit the data as an archive? Any idea?

On Mon, Jan 27, 2020, 9:25 PM Chris Teoh <[hidden email]> wrote:

On Tue, 28 Jan 2020, 9:46 am Tharindu Mathew, <[hidden email]> wrote:
Hi,

Newbie to pyspark/spark here.

I'm trying to submit a job to pyspark with a dependency. Spark DL in this case. While the local environment has this the pyspark does not see it. How do I correctly start pyspark so that it sees this dependency? 

Using Spark 2.3.0 in a cloudera setup.

--
Regards,
Tharindu Mathew


--
Chris