Loading Python libraries into Spark

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
mrm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Loading Python libraries into Spark

mrm
Hi,

I am new to Spark (and almost-new in python!). How can I download and install a Python library in my cluster so I can just import it later?

Any help would be much appreciated.

Thanks!

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Loading Python libraries into Spark

Andrei
For third party libraries the simplest way is to use Puppet [1] or Chef [2] or any similar automation tool to install packages (either from PIP [2] or from distribution's repository). It's easy because if you manage your cluster's software you are most probably already using one of these automation tools, and thus need only to write one more recipe to keep your Python packages healthy.

For your own libraries it may be inconvenient to publish them to PIP just to deploy to your server. So you can also "attach" your Python lib to SparkContext via "pyFiles" option or "addPyFile" method. For example, if you want to attach single Python file, you can do the following:

    conf = SparkConf(...) ...
    sc = SparkContext(...)
    sc.addPyFile("/path/to/yourmodule.py")

And for entire packages (in Python package is any directory with "__init__.py" and maybe several more "*.py" files) you can use a trick and pack them into zip archive. I used following code more my library:

    import dictconfig
    import zipfile

    def ziplib():
        libpath = os.path.dirname(__file__)                  # this should point to your packages directory
        zippath = '/tmp/mylib-' + rand_str(6) + '.zip'      # some random filename in writable directory
        zf = zipfile.PyZipFile(zippath, mode='w')
        try:
            zf.debug = 3                                              # making it verbose, good for debugging
            zf.writepy(libpath)
            return zippath                                             # return path to generated zip archive
        finally:
            zf.close()

    ...
    zip_path = ziplib()                                               # generate zip archive containing your lib                           
    sc.addPyFile(zip_path)                                       # add the entire archive to SparkContext
    ...
    os.remove(zip_path)                                           # don't forget to remove temporary file, preferably in "finally" clause


Python has one nice feature: it can import code not only from modules (simple "*.py" files) and packages, but also from a variety of other formats including zip archives. So when you write in your distributed code something like "import mylib", Python finds "mylib.zip" attached to SparkContext and imports required modules.

HTH,
Andrei



   



[1]: http://puppetlabs.com/
[2]: http://www.getchef.com/chef/
[3]: https://pypi.python.org/pypi/pip ; if you program in Python and still don't use PIP, you should definitely give it a try


On Thu, Jun 5, 2014 at 5:29 PM, mrm <[hidden email]> wrote:
Hi,

I am new to Spark (and almost-new in python!). How can I download and
install a Python library in my cluster so I can just import it later?

Any help would be much appreciated.

Thanks!





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

mrm
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Loading Python libraries into Spark

mrm
Hi Andrei,

Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged into the master node and have my python files somewhere, is that correct?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Loading Python libraries into Spark

Andrei
In my answer I assumed you run your program with "pyspark" command (e.g. "pyspark mymainscript.py", pyspark should be on your path). In this case workflow is as follows:

1. You create SparkConf object that simply contains your app's options.
2. You create SparkContext, which initializes your application. At this point application connects to master and asks for resources.
3. You modify SparkContext object to include everything you want to make available for mappers on other hosts, e.g. other "*.py" files.
4. You create RDD (e.g. with "sc.textFile") and run actual commands (e.g. "map", "filter", etc.). SparkContext knows about your additional files, so these commands are aware of your library code.

So, yes, in these settings you need to create "sc" (SparkContext object) beforehand and make "*.py" files available on application's host.

With pyspark shell you already do have "sc" object initialized for you (try running "pyspark" and typing "sc" + Enter - shell will print spark context details). You can also use spark-submit [1], which will initialize SparkContext from command line options. But essentially idea is always the same: there's driver application running on one host that creates SparkContext, collects dependencies, controls program flow, etc., and there are workers - applications on slave hosts, that use created SparkContext and all serialized data to perform driver's commands. Driver should know about everything and let workers know about what they need to know (e.g. your library code).


[1]: http://spark.apache.org/docs/latest/submitting-applications.html





On Thu, Jun 5, 2014 at 8:10 PM, mrm <[hidden email]> wrote:
Hi Andrei,

Thank you for your help! Just to make sure I understand, when I run this
command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged
into the master node and have my python files somewhere, is that correct?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-tp7059p7073.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Loading...