For third party libraries the simplest way is to use Puppet  or Chef  or any similar automation tool to install packages (either from PIP  or from distribution's repository). It's easy because if you manage your cluster's software you are most probably already using one of these automation tools, and thus need only to write one more recipe to keep your Python packages healthy.
For your own libraries it may be inconvenient to publish them to PIP just to deploy to your server. So you can also "attach" your Python lib to SparkContext via "pyFiles" option or "addPyFile" method. For example, if you want to attach single Python file, you can do the following:
conf = SparkConf(...) ...
sc = SparkContext(...)
And for entire packages (in Python package is any directory with "__init__.py" and maybe several more "*.py" files) you can use a trick and pack them into zip archive. I used following code more my library:
import dictconfig import zipfile
def ziplib(): libpath = os.path.dirname(__file__) # this should point to your packages directory zippath = '/tmp/mylib-' + rand_str(6) + '.zip' # some random filename in writable directory
zf = zipfile.PyZipFile(zippath, mode='w') try: zf.debug = 3 # making it verbose, good for debugging zf.writepy(libpath)
return zippath # return path to generated zip archive finally: zf.close()
zip_path = ziplib() # generate zip archive containing your lib
sc.addPyFile(zip_path) # add the entire archive to SparkContext ...
os.remove(zip_path) # don't forget to remove temporary file, preferably in "finally" clause
Python has one nice feature: it can import code not only from modules (simple "*.py" files) and packages, but also from a variety of other formats including zip archives. So when you write in your distributed code something like "import mylib", Python finds "mylib.zip" attached to SparkContext and imports required modules.
Thank you for your help! Just to make sure I understand, when I run this command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged into the master node and have my python files somewhere, is that correct?
In my answer I assumed you run your program with "pyspark" command (e.g. "pyspark mymainscript.py", pyspark should be on your path). In this case workflow is as follows:
1. You create SparkConf object that simply contains your app's options.
2. You create SparkContext, which initializes your application. At this point application connects to master and asks for resources.
3. You modify SparkContext object to include everything you want to make available for mappers on other hosts, e.g. other "*.py" files.
4. You create RDD (e.g. with "sc.textFile") and run actual commands (e.g. "map", "filter", etc.). SparkContext knows about your additional files, so these commands are aware of your library code.
So, yes, in these settings you need to create "sc" (SparkContext object) beforehand and make "*.py" files available on application's host.
With pyspark shell you already do have "sc" object initialized for you (try running "pyspark" and typing "sc" + Enter - shell will print spark context details). You can also use spark-submit , which will initialize SparkContext from command line options. But essentially idea is always the same: there's driver application running on one host that creates SparkContext, collects dependencies, controls program flow, etc., and there are workers - applications on slave hosts, that use created SparkContext and all serialized data to perform driver's commands. Driver should know about everything and let workers know about what they need to know (e.g. your library code).
Thank you for your help! Just to make sure I understand, when I run this
command sc.addPyFile("/path/to/yourmodule.py"), I need to be already logged
into the master node and have my python files somewhere, is that correct?