PyCharm, Running spark-submit calling jars and a package at run time

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh
Hi,

I have a module in Pycharm which reads data stored in a Bigquery table and does plotting.

At the command line on the terminal I need to add the jar file and the packet to make it work.

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices

_GCP.py


This works but the problem is that the imports into the module are not picked up.  Example


import sparkstuff as s


This is picked up when run within Pycharm itself but not at the command line!


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices

_GCP.py

Traceback (most recent call last):

  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>

    import sparkstuff as s

ModuleNotFoundError: No module named 'sparkstuff'


The easiest option would be to run all these within PyCharm itself invoking the jar file and package at runtime.

Otherwise I can run it at the command line but being able to resolve imports. I appreciate any work-around this.

Thanks


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Riccardo Ferrari
You need to provide your python dependencies as well. See http://spark.apache.org/docs/latest/submitting-applications.html, look for --py-files

HTH

On Fri, Jan 8, 2021 at 3:13 PM Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a module in Pycharm which reads data stored in a Bigquery table and does plotting.

At the command line on the terminal I need to add the jar file and the packet to make it work.

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices

_GCP.py


This works but the problem is that the imports into the module are not picked up.  Example


import sparkstuff as s


This is picked up when run within Pycharm itself but not at the command line!


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices

_GCP.py

Traceback (most recent call last):

  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>

    import sparkstuff as s

ModuleNotFoundError: No module named 'sparkstuff'


The easiest option would be to run all these within PyCharm itself invoking the jar file and package at runtime.

Otherwise I can run it at the command line but being able to resolve imports. I appreciate any work-around this.

Thanks


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 15:51, Riccardo Ferrari <[hidden email]> wrote:
You need to provide your python dependencies as well. See http://spark.apache.org/docs/latest/submitting-applications.html, look for --py-files

HTH

On Fri, Jan 8, 2021 at 3:13 PM Mich Talebzadeh <[hidden email]> wrote:
Hi,

I have a module in Pycharm which reads data stored in a Bigquery table and does plotting.

At the command line on the terminal I need to add the jar file and the packet to make it work.

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices

_GCP.py


This works but the problem is that the imports into the module are not picked up.  Example


import sparkstuff as s


This is picked up when run within Pycharm itself but not at the command line!


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices

_GCP.py

Traceback (most recent call last):

  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>

    import sparkstuff as s

ModuleNotFoundError: No module named 'sparkstuff'


The easiest option would be to run all these within PyCharm itself invoking the jar file and package at runtime.

Otherwise I can run it at the command line but being able to resolve imports. I appreciate any work-around this.

Thanks


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

analyze_house_prices_GCP.py (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

srowen
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Riccardo Ferrari
I think spark checks the python path env variable. Need to provide that.
Of course that works in local mode only

On Fri, Jan 8, 2021, 5:28 PM Sean Owen <[hidden email]> wrote:
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh
In reply to this post by srowen
Hi Sean,


sparkstuff.py is under packages/sparutils/sparkstuff.py as shown below


image.png


So within PyCharm, it is picked up OK. However, at terminal level, it is not picked up.


THis is a snapshot of Pycharm. The module I am trying to run is called analyze_house_prices_GCP.py under src package. At the same level of src I have the utility package called packages that has all  Spark related stuff. These are in sparkstuff.py


from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
#import findspark
#findspark.init()

def spark_session(appName):
return SparkSession.builder \
.appName(appName) \
.enableHiveSupport() \
.getOrCreate()

def sparkcontext():
return SparkContext.getOrCreate()

def hivecontext():
return HiveContext(sparkcontext())

def spark_session_local(appName):
return SparkSession.builder \
.master('local[1]') \
.appName(appName) \
.enableHiveSupport() \
.getOrCreate()


Thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 16:27, Sean Owen <[hidden email]> wrote:
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh
In reply to this post by Riccardo Ferrari
Hi Riccardo

This is the env variables at runtime

PYTHONUNBUFFERED=1;PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src

This is the configuration set up for analyze_house_prices_GCP

image.png




So like in Linux, I created a windows env variable and on PyCharm terminal, I can see it



(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>echo %PYTHONPATH%

PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\

;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src


It picks up sparkstuff.py


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>where sparkstuff.py

C:\Users\admin\PycharmProjects\packages\sparkutils\sparkstuff.py


But in spark-submit within the code it does not

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices_GCP
.py
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkutils'

thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari <[hidden email]> wrote:
I think spark checks the python path env variable. Need to provide that.
Of course that works in local mode only

On Fri, Jan 8, 2021, 5:28 PM Sean Owen <[hidden email]> wrote:
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

srowen
THis isn't going to help submitting to a remote cluster though. You need to explicitly include dependencies in your submit.

On Fri, Jan 8, 2021 at 11:15 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Riccardo

This is the env variables at runtime

PYTHONUNBUFFERED=1;PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src

This is the configuration set up for analyze_house_prices_GCP

image.png




So like in Linux, I created a windows env variable and on PyCharm terminal, I can see it



(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>echo %PYTHONPATH%

PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\

;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src


It picks up sparkstuff.py


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>where sparkstuff.py

C:\Users\admin\PycharmProjects\packages\sparkutils\sparkstuff.py


But in spark-submit within the code it does not

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices_GCP
.py
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkutils'

thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari <[hidden email]> wrote:
I think spark checks the python path env variable. Need to provide that.
Of course that works in local mode only

On Fri, Jan 8, 2021, 5:28 PM Sean Owen <[hidden email]> wrote:
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh

Just to clarify, are you referring to module dependencies in PySpark?


With Scala I can create a Uber jar file inclusive of all bits and pieces built with maven or sbt that works in a cluster and submit to spark-submit as a uber jar file.


what alternatives would you suggest for PySpark, a zip file? 


cheers,


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 17:18, Sean Owen <[hidden email]> wrote:
THis isn't going to help submitting to a remote cluster though. You need to explicitly include dependencies in your submit.

On Fri, Jan 8, 2021 at 11:15 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Riccardo

This is the env variables at runtime

PYTHONUNBUFFERED=1;PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src

This is the configuration set up for analyze_house_prices_GCP

image.png




So like in Linux, I created a windows env variable and on PyCharm terminal, I can see it



(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>echo %PYTHONPATH%

PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\

;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src


It picks up sparkstuff.py


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>where sparkstuff.py

C:\Users\admin\PycharmProjects\packages\sparkutils\sparkstuff.py


But in spark-submit within the code it does not

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices_GCP
.py
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkutils'

thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari <[hidden email]> wrote:
I think spark checks the python path env variable. Need to provide that.
Of course that works in local mode only

On Fri, Jan 8, 2021, 5:28 PM Sean Owen <[hidden email]> wrote:
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks



Reply | Threaded
Open this post in threaded view
|

Re: PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh

Well, I decided to have a go at this.

As I understand PyCharm is the glue that holds all these modules together and resolves dependencies internally. This means that all imports in modules can be taken care of.

When one runs the module alone through the command line in the virtual environment at Teminal, dependencies are not resolved so things don't work and imports from within modules are thrown errors. In short spark-submit has no recollection of dependencies (please correct me if this is wrong)

After thinking about it (this is purely a run time issue with Spark) and looking around under the directory %SPARK_HOME% (you have set this up in windows user environment) like below

image.png


This is the Spark version that PyCharm/PySpark uses. Go to the jars directory in C:\spark-3.0.1-bin-hadoop2.7\jars and put your needed jar files there. 


image.png



Then go back to PyCharm and run the module INSIDE the PyCharm itself the usual way (right click the module), it works and uses that jar file to read BigQuery table from PyCharm with Spark. It should pickup that jar file and works.


spark.conf.set("GcpJsonKeyFile",v.jsonKeyFile)
spark.conf.set("BigQueryProjectId",v.projectId)
spark.conf.set("BigQueryDatasetLocation",v.datasetLocation)
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("fs.gs.project.id", v.projectId)
spark.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.conf.set("temporaryGcsBucket", v.tmp_bucket)

sqltext = ""
from pyspark.sql.window import Window

# read data from the Bigquery table in staging area
print("\nreading data from "+v.projectId+":"+v.inputTable)
source_df = spark.read. \
format("bigquery"). \
option("dataset", v.sourceDataset). \
option("table", v.sourceTable). \
load()

This is the output.


image.png


The next step for me is how to figure out running a package at runtime i.e. spark-submit --package <PACKAGE>


HTH,


Mich

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 17:32, Mich Talebzadeh <[hidden email]> wrote:

Just to clarify, are you referring to module dependencies in PySpark?


With Scala I can create a Uber jar file inclusive of all bits and pieces built with maven or sbt that works in a cluster and submit to spark-submit as a uber jar file.


what alternatives would you suggest for PySpark, a zip file? 


cheers,


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 17:18, Sean Owen <[hidden email]> wrote:
THis isn't going to help submitting to a remote cluster though. You need to explicitly include dependencies in your submit.

On Fri, Jan 8, 2021 at 11:15 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Riccardo

This is the env variables at runtime

PYTHONUNBUFFERED=1;PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src

This is the configuration set up for analyze_house_prices_GCP

image.png




So like in Linux, I created a windows env variable and on PyCharm terminal, I can see it



(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>echo %PYTHONPATH%

PYTHONPATH=C:\Users\admin\PycharmProjects\packages\;C:\Users\admin\PycharmProjects\pythonProject2\DS\;C:\Users\admin\PycharmProjects\pythonProject2\DS\conf\

;C:\Users\admin\PycharmProjects\pythonProject2\DS\lib\;C:\Users\admin\PycharmProjects\pythonProject2\DS\src


It picks up sparkstuff.py


(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>where sparkstuff.py

C:\Users\admin\PycharmProjects\packages\sparkutils\sparkstuff.py


But in spark-submit within the code it does not

(venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\spark-bigquery-with-dependencies_2.12-0.18.0.jar analyze_house_prices_GCP
.py
Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkutils'

thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari <[hidden email]> wrote:
I think spark checks the python path env variable. Need to provide that.
Of course that works in local mode only

On Fri, Jan 8, 2021, 5:28 PM Sean Owen <[hidden email]> wrote:
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise?

On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks Riccardo.

I am well aware of the submission form 

However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm terminal to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar \
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6 analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py", line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple 

Thanks