Issue while installing dependencies Python Spark

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Issue while installing dependencies Python Spark

Sachit Murarka
Hi Users 

I have a wheel file , while creating it I have mentioned dependencies in setup.py file. 
Now I have 2 virtual envs, 1 was already there . another one I created just now.

I have switched to new virtual env, I want spark to download the dependencies while doing spark-submit using wheel. 

Could you please help me on this?

It is not downloading dependencies , instead it is pointing to older version of  virtual env and proceeding with the execution of spark job.

Please note I have tried setting the env variables also.
Also I have tried following options as well in spark submit

--conf spark.pyspark.virtualenv.enabled=true  --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt  --conf spark.pyspark.python= /path/to/venv/bin/python3 --conf spark.pyspark.driver.python=/path/to/venv/bin/python3

This did not help too..

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Issue while installing dependencies Python Spark

Patrick McCarthy-2
I'm not very familiar with the environments on cloud clusters, but in general I'd be reluctant to lean on setuptools or other python install mechanisms. In the worst case, you might encounter /usr/bin/pip not having permissions to install new packages, or even if you do a package might require something you can't change like a libc dependency. 

Perhaps you can install the .whl and its dependencies to the virtualenv on a local machine, and then *after* the install process, package that venv? 

If possible, I like conda for this approach over a vanilla venv because it will contain all the non-python dependencies (like libc) if they're needed.


Another thing - I think there are several ways to do this, but I've had the most success including the .zip containing my environment in `spark.yarn.dist.archives` and then using a relative path:

os.environ['PYSPARK_PYTHON'] = './py37minimal_env/py37minimal/bin/python'

dist_archives = 'hdfs:///user/pmccarthy/conda/py37minimal.zip#py37minimal_env'

SparkSession.builder.
...
         .config('spark.yarn.dist.archives', dist_archives)


On Thu, Dec 17, 2020 at 10:32 AM Sachit Murarka <[hidden email]> wrote:
Hi Users 

I have a wheel file , while creating it I have mentioned dependencies in setup.py file. 
Now I have 2 virtual envs, 1 was already there . another one I created just now.

I have switched to new virtual env, I want spark to download the dependencies while doing spark-submit using wheel. 

Could you please help me on this?

It is not downloading dependencies , instead it is pointing to older version of  virtual env and proceeding with the execution of spark job.

Please note I have tried setting the env variables also.
Also I have tried following options as well in spark submit

--conf spark.pyspark.virtualenv.enabled=true  --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt  --conf spark.pyspark.python= /path/to/venv/bin/python3 --conf spark.pyspark.driver.python=/path/to/venv/bin/python3

This did not help too..

Kind Regards,
Sachit Murarka


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: Issue while installing dependencies Python Spark

Artemis User
In reply to this post by Sachit Murarka

Wheel is used for package management and setting up your virtual environment , not used as a library package.  To run spark-submit in a virtual env, use the --py-files option instead.  Usage:

--py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.

In other words, you can't run spark-submit in a virtual environment like a regular python program, since it is NOT a regular python script.  But you can package your python spark project (including all dependency libs) as a zip or egg file and make it available to spark-submit.  Please note that spark-submit plays the role of a driver.  It's only responsible for submitting jobs to a spark master.  The master will distribute the job content, including all dependencies libs, to individual worker nodes where the job is executed.  By packaging in zip or egg format, it will be easier to do distribution. 

-- ND

On 12/17/20 10:31 AM, Sachit Murarka wrote:
Hi Users 

I have a wheel file , while creating it I have mentioned dependencies in setup.py file. 
Now I have 2 virtual envs, 1 was already there . another one I created just now.

I have switched to new virtual env, I want spark to download the dependencies while doing spark-submit using wheel. 

Could you please help me on this?

It is not downloading dependencies , instead it is pointing to older version of  virtual env and proceeding with the execution of spark job.

Please note I have tried setting the env variables also.
Also I have tried following options as well in spark submit

--conf spark.pyspark.virtualenv.enabled=true  --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt  --conf spark.pyspark.python= /path/to/venv/bin/python3 --conf spark.pyspark.driver.python=/path/to/venv/bin/python3

This did not help too..

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Issue while installing dependencies Python Spark

Sachit Murarka
In reply to this post by Patrick McCarthy-2
Hi Patrick/Users,

I am exploring wheel file form packages for this , as this seems simple:-


However, I am facing another issue:- I am using pandas , which needs numpy. Numpy is giving error!


ImportError: Unable to import required dependencies:
numpy:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.7 from "/usr/bin/python3"
  * The NumPy version is: "1.19.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'



Kind Regards,
Sachit Murarka


On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy <[hidden email]> wrote:
I'm not very familiar with the environments on cloud clusters, but in general I'd be reluctant to lean on setuptools or other python install mechanisms. In the worst case, you might encounter /usr/bin/pip not having permissions to install new packages, or even if you do a package might require something you can't change like a libc dependency. 

Perhaps you can install the .whl and its dependencies to the virtualenv on a local machine, and then *after* the install process, package that venv? 

If possible, I like conda for this approach over a vanilla venv because it will contain all the non-python dependencies (like libc) if they're needed.


Another thing - I think there are several ways to do this, but I've had the most success including the .zip containing my environment in `spark.yarn.dist.archives` and then using a relative path:

os.environ['PYSPARK_PYTHON'] = './py37minimal_env/py37minimal/bin/python'

dist_archives = 'hdfs:///user/pmccarthy/conda/py37minimal.zip#py37minimal_env'

SparkSession.builder.
...
         .config('spark.yarn.dist.archives', dist_archives)


On Thu, Dec 17, 2020 at 10:32 AM Sachit Murarka <[hidden email]> wrote:
Hi Users 

I have a wheel file , while creating it I have mentioned dependencies in setup.py file. 
Now I have 2 virtual envs, 1 was already there . another one I created just now.

I have switched to new virtual env, I want spark to download the dependencies while doing spark-submit using wheel. 

Could you please help me on this?

It is not downloading dependencies , instead it is pointing to older version of  virtual env and proceeding with the execution of spark job.

Please note I have tried setting the env variables also.
Also I have tried following options as well in spark submit

--conf spark.pyspark.virtualenv.enabled=true  --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt  --conf spark.pyspark.python= /path/to/venv/bin/python3 --conf spark.pyspark.driver.python=/path/to/venv/bin/python3

This did not help too..

Kind Regards,
Sachit Murarka


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: Issue while installing dependencies Python Spark

Patrick McCarthy-2
At the risk of repeating myself, this is what I was hoping to avoid when I suggested deploying a full, zipped, conda venv. 

What is your motivation for running an install process on the nodes and risking the process failing, instead of pushing a validated environment artifact and not having that risk? In either case you move about the same number of bytes around.

On Fri, Dec 18, 2020 at 3:04 PM Sachit Murarka <[hidden email]> wrote:
Hi Patrick/Users,

I am exploring wheel file form packages for this , as this seems simple:-


However, I am facing another issue:- I am using pandas , which needs numpy. Numpy is giving error!


ImportError: Unable to import required dependencies:
numpy:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.7 from "/usr/bin/python3"
  * The NumPy version is: "1.19.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'



Kind Regards,
Sachit Murarka


On Thu, Dec 17, 2020 at 9:24 PM Patrick McCarthy <[hidden email]> wrote:
I'm not very familiar with the environments on cloud clusters, but in general I'd be reluctant to lean on setuptools or other python install mechanisms. In the worst case, you might encounter /usr/bin/pip not having permissions to install new packages, or even if you do a package might require something you can't change like a libc dependency. 

Perhaps you can install the .whl and its dependencies to the virtualenv on a local machine, and then *after* the install process, package that venv? 

If possible, I like conda for this approach over a vanilla venv because it will contain all the non-python dependencies (like libc) if they're needed.


Another thing - I think there are several ways to do this, but I've had the most success including the .zip containing my environment in `spark.yarn.dist.archives` and then using a relative path:

os.environ['PYSPARK_PYTHON'] = './py37minimal_env/py37minimal/bin/python'

dist_archives = 'hdfs:///user/pmccarthy/conda/py37minimal.zip#py37minimal_env'

SparkSession.builder.
...
         .config('spark.yarn.dist.archives', dist_archives)


On Thu, Dec 17, 2020 at 10:32 AM Sachit Murarka <[hidden email]> wrote:
Hi Users 

I have a wheel file , while creating it I have mentioned dependencies in setup.py file. 
Now I have 2 virtual envs, 1 was already there . another one I created just now.

I have switched to new virtual env, I want spark to download the dependencies while doing spark-submit using wheel. 

Could you please help me on this?

It is not downloading dependencies , instead it is pointing to older version of  virtual env and proceeding with the execution of spark job.

Please note I have tried setting the env variables also.
Also I have tried following options as well in spark submit

--conf spark.pyspark.virtualenv.enabled=true  --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt  --conf spark.pyspark.python= /path/to/venv/bin/python3 --conf spark.pyspark.driver.python=/path/to/venv/bin/python3

This did not help too..

Kind Regards,
Sachit Murarka


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016