Can't find pyspark when using PySpark on YARN

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Can't find pyspark when using PySpark on YARN

洪奇
Dear all,

When I submit a pyspark application using this command:

./bin/spark-submit --master yarn-client examples/src/main/python/wordcount.py "hdfs://..."

I get the following exception:

Error from python worker:
Traceback (most recent call last):
File "/usr/ali/lib/python2.5/runpy.py", line 85, in run_module
loader = get_loader(mod_name)
File "/usr/ali/lib/python2.5/pkgutil.py", line 456, in get_loader
return find_loader(fullname)
File "/usr/ali/lib/python2.5/pkgutil.py", line 466, in find_loader
for importer in iter_importers(fullname):
File "/usr/ali/lib/python2.5/pkgutil.py", line 422, in iter_importers
__import__(pkg)
ImportError: No module named pyspark
PYTHONPATH was:
/home/xxx/spark/python:/home/xxx/spark_on_yarn/python/lib/py4j-0.8.1-src.zip:/disk11/mapred/tmp/usercache/xxxx/filecache/11/spark-assembly-1.0.0-hadoop2.0.0-ydh2.0.0.jar

Maybe `pyspark/python` and `py4j-0.8.1-src.zip` is not included in the YARN worker, 
How can I distribute these files with my application? Can I use `--pyfiles python.zip, py4j-0.8.1-src.zip `?
Or how can I package modules in pyspark to a .egg file?



Reply | Threaded
Open this post in threaded view
|

Re: Can't find pyspark when using PySpark on YARN

Andrew Or-2
Hi Qi Ping,

You don't have to distribute these files; they are automatically packaged in the assembly jar, which is already shipped to the worker nodes.

Other people have run into the same issue. See if the instructions here are of any help: http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3cCAMJOb8mr1+ias-SLDz_RfRKe_nA2UUbNmHraC4NUKqYqNUNHuQ@...%3e

As described in the link, the last resort is to try building your assembly jar with JAVA_HOME set to Java 6. This usually fixes the problem (more details in the link provided).

Cheers,
Andrew


2014-06-10 6:35 GMT-07:00 李奇平 <[hidden email]>:
Dear all,

When I submit a pyspark application using this command:

./bin/spark-submit --master yarn-client examples/src/main/python/wordcount.py "hdfs://..."

I get the following exception:

Error from python worker:
Traceback (most recent call last):
File "/usr/ali/lib/python2.5/runpy.py", line 85, in run_module
loader = get_loader(mod_name)
File "/usr/ali/lib/python2.5/pkgutil.py", line 456, in get_loader
return find_loader(fullname)
File "/usr/ali/lib/python2.5/pkgutil.py", line 466, in find_loader
for importer in iter_importers(fullname):
File "/usr/ali/lib/python2.5/pkgutil.py", line 422, in iter_importers
__import__(pkg)
ImportError: No module named pyspark
PYTHONPATH was:
/home/xxx/spark/python:/home/xxx/spark_on_yarn/python/lib/py4j-0.8.1-src.zip:/disk11/mapred/tmp/usercache/xxxx/filecache/11/spark-assembly-1.0.0-hadoop2.0.0-ydh2.0.0.jar

Maybe `pyspark/python` and `py4j-0.8.1-src.zip` is not included in the YARN worker, 
How can I distribute these files with my application? Can I use `--pyfiles python.zip, py4j-0.8.1-src.zip `?
Or how can I package modules in pyspark to a .egg file?