Add python library with native code

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Add python library with native code

Stone Zhong-2
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone

Reply | Threaded
Open this post in threaded view
|

Re: Add python library with native code

Dark Crusader
Hi Stone,

Have you looked into this article?

I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies.
I Hope it helps. 

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <[hidden email]> wrote:
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone

Reply | Threaded
Open this post in threaded view
|

Re: Add python library with native code

Stone Zhong-2
Thanks Dark. Looked at that article. I think the article described approach B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your code, manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip install -r <temp_dir>"

I think approach B is very similar to approach A, both has pros and cons. With B), your cluster need to have internet access (which in my case, our cluster runs in an isolated environment for security reason), but you can set a private pip server anyway and stage those needed packages, while for A, you need to have admin permission to be able to mount the network share which is also a devop burden.

I am wondering if spark can create some new API to tackle this scenario instead of these workaround, which I suppose would be more clean and elegant.

Regards,
Stone


On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader <[hidden email]> wrote:
Hi Stone,

Have you looked into this article?

I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies.
I Hope it helps. 

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <[hidden email]> wrote:
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone

Reply | Threaded
Open this post in threaded view
|

Re: Add python library with native code

Masood Krohy

Not totally sure it's gonna help your use case, but I'd recommend that you consider these too:

  • pex  A library and tool for generating .pex (Python EXecutable) files
  • cluster-pack   cluster-pack is a library on top of either pex or conda-pack to make your Python code easily available on a cluster.

Masood

__________________

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works
On 6/5/20 4:29 AM, Stone Zhong wrote:
Thanks Dark. Looked at that article. I think the article described approach B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your code, manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip install -r <temp_dir>"

I think approach B is very similar to approach A, both has pros and cons. With B), your cluster need to have internet access (which in my case, our cluster runs in an isolated environment for security reason), but you can set a private pip server anyway and stage those needed packages, while for A, you need to have admin permission to be able to mount the network share which is also a devop burden.

I am wondering if spark can create some new API to tackle this scenario instead of these workaround, which I suppose would be more clean and elegant.

Regards,
Stone


On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader <[hidden email]> wrote:
Hi Stone,

Have you looked into this article?

I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies.
I Hope it helps. 

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <[hidden email]> wrote:
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone

Reply | Threaded
Open this post in threaded view
|

Re: Add python library with native code

Stone Zhong-2
Great, thank you Masood, will look into it.

Regards,
Stone

On Fri, Jun 5, 2020 at 7:47 PM Masood Krohy <[hidden email]> wrote:

Not totally sure it's gonna help your use case, but I'd recommend that you consider these too:

  • pex  A library and tool for generating .pex (Python EXecutable) files
  • cluster-pack   cluster-pack is a library on top of either pex or conda-pack to make your Python code easily available on a cluster.

Masood

__________________

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works
On 6/5/20 4:29 AM, Stone Zhong wrote:
Thanks Dark. Looked at that article. I think the article described approach B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your code, manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip install -r <temp_dir>"

I think approach B is very similar to approach A, both has pros and cons. With B), your cluster need to have internet access (which in my case, our cluster runs in an isolated environment for security reason), but you can set a private pip server anyway and stage those needed packages, while for A, you need to have admin permission to be able to mount the network share which is also a devop burden.

I am wondering if spark can create some new API to tackle this scenario instead of these workaround, which I suppose would be more clean and elegant.

Regards,
Stone


On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader <[hidden email]> wrote:
Hi Stone,

Have you looked into this article?

I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies.
I Hope it helps. 

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <[hidden email]> wrote:
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone

Reply | Threaded
Open this post in threaded view
|

Add python library

Anwar AliKhan
In reply to this post by Dark Crusader
 " > Have you looked into this article?

This is weird !
When I came across this post.

The weird part is I was just wondering  how I can take one of the projects(Open AI GYM taxi-vt2 in Python), a project I want to develop further.

I want to run on Spark using Spark's parallelism features and GPU capabilities,  when I am using bigger datasets . While installing the workers (slaves)  doing the sliced dataset computations on the new 8GB RAM Raspberry Pi (Linux).

Are any other documents on official website which shows how to do that,  or any other location  , preferably showing full self contained examples?



On Fri, 5 Jun 2020, 09:02 Dark Crusader, <[hidden email]> wrote:
Hi Stone,


I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies.
I Hope it helps. 

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <[hidden email]> wrote:
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone

Reply | Threaded
Open this post in threaded view
|

Re: Add python library

Patrick McCarthy-2
I've found Anaconda encapsulates modules and dependencies and such nicely, and you can deploy all the needed .so files and such by deploying a whole conda environment. 


On Sat, Jun 6, 2020 at 4:16 PM Anwar AliKhan <[hidden email]> wrote:
 " > Have you looked into this article?

This is weird !
When I came across this post.

The weird part is I was just wondering  how I can take one of the projects(Open AI GYM taxi-vt2 in Python), a project I want to develop further.

I want to run on Spark using Spark's parallelism features and GPU capabilities,  when I am using bigger datasets . While installing the workers (slaves)  doing the sliced dataset computations on the new 8GB RAM Raspberry Pi (Linux).

Are any other documents on official website which shows how to do that,  or any other location  , preferably showing full self contained examples?



On Fri, 5 Jun 2020, 09:02 Dark Crusader, <[hidden email]> wrote:
Hi Stone,


I haven't tried it with .so files however I did use the approach he recommends to install my other dependencies.
I Hope it helps. 

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <[hidden email]> wrote:
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I pack all the dependencies into a file libs.zip, and then call sc.addPyFile("libs.zip") and it works pretty well for a while. 

Then I encountered a problem, if any of my library has any binary file dependency (like .so files), this approach does not work. Mainly because when you set PYTHONPATH to a zip file, python does not look up needed binary library (e.g. a .so file) inside the zip file, this is a python limitation. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current directory
2) When my python code starts, manually call sys.path.insert(0, f"{os.getcwd()}/libs") to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my code in executor need python library that has binary code? Below is am example: 

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has binary code? My current solution is, extract libs.zip into a NFS share (or a SMB share) and manually do sys.path.insert(0, f"share_mount_dir/libs") in my "do_something" function, but adding such code in each function looks ugly, is there any better/elegant solution?

Thanks,
Stone



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016