Python Dependencies Issue on EMR

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Python Dependencies Issue on EMR

Jonas Shomorony

Hey everyone,


I am currently trying to run a Python Spark job (using YARN client mode) that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that, I create a dependencies.zip file that contains all of the dependencies/libraries (installed through pip) for the job to run successfully, such as pandas, scipy, tqdm, psycopg2, etc. The dependencies.zip file is contained within an outside directory (let’s call it “project”) that contains all the code to run my Spark job. I then zip up everything within project (including dependencies.zip) into project.zip. Then, I call spark-submit on the master node in my EMR cluster as follows:


`spark-submit --packages … --py-files project.zip --jars ... run_command.py`


Within “run_command.py” I add dependencies.zip as follows:

`self.spark.sparkContext.addPyFile("dependencies.zip”)`


The run_command.py then uses other files within project.zip to complete the spark job, and within those files, I import various libraries (found in dependencies.zip). 


I am running into a strange issue where all of the libraries are imported correctly (with no problems) with the exception of scipy and pandas. 


For scipy I get the following error:


`File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in <module>

  File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line 1, in <module>

ImportError: cannot import name _ccallback_c`


And for pandas I get this error message:


`File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, in <module>

ImportError: C extension: No module named tslib not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.`


When I comment out the imports for these two libraries (and their use from within the code) everything works fine. 


Surprisingly, when I run the application locally (on master node) without passing in dependencies.zip, it picks and resolves the libraries from site-packages correctly and successfully runs to completion. dependencies.zip is created by zipping the contents of site-packages.


Does anyone have any ideas as to what may be happening here? I would really appreciate it.


pip version: 18.0

spark version: 2.3.1

python version: 2.7


Thank you,


Jonas


Reply | Threaded
Open this post in threaded view
|

Re: Python Dependencies Issue on EMR

Patrick McCarthy-2
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included.

An approach like this solved the last problem I had that seemed like this - https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html

On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony <[hidden email]> wrote:

Hey everyone,


I am currently trying to run a Python Spark job (using YARN client mode) that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that, I create a dependencies.zip file that contains all of the dependencies/libraries (installed through pip) for the job to run successfully, such as pandas, scipy, tqdm, psycopg2, etc. The dependencies.zip file is contained within an outside directory (let’s call it “project”) that contains all the code to run my Spark job. I then zip up everything within project (including dependencies.zip) into project.zip. Then, I call spark-submit on the master node in my EMR cluster as follows:


`spark-submit --packages … --py-files project.zip --jars ... run_command.py`


Within “run_command.py” I add dependencies.zip as follows:

`self.spark.sparkContext.addPyFile("dependencies.zip”)`


The run_command.py then uses other files within project.zip to complete the spark job, and within those files, I import various libraries (found in dependencies.zip). 


I am running into a strange issue where all of the libraries are imported correctly (with no problems) with the exception of scipy and pandas. 


For scipy I get the following error:


`File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in <module>

  File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line 1, in <module>

ImportError: cannot import name _ccallback_c`


And for pandas I get this error message:


`File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, in <module>

ImportError: C extension: No module named tslib not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.`


When I comment out the imports for these two libraries (and their use from within the code) everything works fine. 


Surprisingly, when I run the application locally (on master node) without passing in dependencies.zip, it picks and resolves the libraries from site-packages correctly and successfully runs to completion. dependencies.zip is created by zipping the contents of site-packages.


Does anyone have any ideas as to what may be happening here? I would really appreciate it.


pip version: 18.0

spark version: 2.3.1

python version: 2.7


Thank you,


Jonas



Reply | Threaded
Open this post in threaded view
|

Re: Python Dependencies Issue on EMR

Jonas Shomorony
Thanks Patrick. Using a conda virtual environment did help with libraries that required the extra C stuff.

Jonas

On Fri, Sep 14, 2018 at 8:02 AM Patrick McCarthy <[hidden email]> wrote:
You didn't say how you're zipping the dependencies, but I'm guessing you either include .egg files or zipped up a virtualenv. In either case, the extra C stuff that scipy and pandas rely upon doesn't get included.

An approach like this solved the last problem I had that seemed like this - https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html

On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony <[hidden email]> wrote:

Hey everyone,


I am currently trying to run a Python Spark job (using YARN client mode) that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that, I create a dependencies.zip file that contains all of the dependencies/libraries (installed through pip) for the job to run successfully, such as pandas, scipy, tqdm, psycopg2, etc. The dependencies.zip file is contained within an outside directory (let’s call it “project”) that contains all the code to run my Spark job. I then zip up everything within project (including dependencies.zip) into project.zip. Then, I call spark-submit on the master node in my EMR cluster as follows:


`spark-submit --packages … --py-files project.zip --jars ... run_command.py`


Within “run_command.py” I add dependencies.zip as follows:

`self.spark.sparkContext.addPyFile("dependencies.zip”)`


The run_command.py then uses other files within project.zip to complete the spark job, and within those files, I import various libraries (found in dependencies.zip). 


I am running into a strange issue where all of the libraries are imported correctly (with no problems) with the exception of scipy and pandas. 


For scipy I get the following error:


`File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in <module>

  File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line 1, in <module>

ImportError: cannot import name _ccallback_c`


And for pandas I get this error message:


`File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, in <module>

ImportError: C extension: No module named tslib not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.`


When I comment out the imports for these two libraries (and their use from within the code) everything works fine. 


Surprisingly, when I run the application locally (on master node) without passing in dependencies.zip, it picks and resolves the libraries from site-packages correctly and successfully runs to completion. dependencies.zip is created by zipping the contents of site-packages.


Does anyone have any ideas as to what may be happening here? I would really appreciate it.


pip version: 18.0

spark version: 2.3.1

python version: 2.7


Thank you,


Jonas