Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

Sergey Zhemzhitsky
Hi PySparkers,

What currently is the best way of shipping self-contained pyspark jobs with 3rd-party dependencies?
There are some open JIRA issues [1], [2] as well as corresponding PRs [3], [4] and articles [5], [6], regarding setting up the python environment with conda and virtualenv respectively.

So I'm wondering what the community does in cases, when it's necessary to
- prevent python package/module version conflicts between different jobs
- prevent updating all the nodes of the cluster in case of new job dependencies
- track which dependencies are introduced on the per-job basis