pyspark problems on yarn (job not parallelized, and Py4JJavaError)

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen
Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in the error message something like this:
"""
Error from python worker:
  /usr/bin/python: No module named pyspark
"""

Am I supposed to install pyspark on every worker node?

Thanks.
-Simon
Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Andrew Or-2
Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try:

1) Run a simple pyspark shell with yarn-client, and do "sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and the Java class files.

3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0

5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:
Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in the error message something like this:
"""
Error from python worker:
  /usr/bin/python: No module named pyspark
"""

Am I supposed to install pyspark on every worker node?

Thanks.
-Simon

Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen
1) yes, that sc.parallelize(range(10)).count() has the same error.

2) the files seem to be correct

3) I have trouble at this step, "ImportError: No module named pyspark"
but I seem to have files in the jar file:
"""
$ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
>>> import pyspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark

$ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
pyspark/
pyspark/rddsampler.py
pyspark/broadcast.py
pyspark/serializers.py
pyspark/java_gateway.py
pyspark/resultiterable.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/__init__.py
pyspark/daemon.py
pyspark/context.py
pyspark/cloudpickle.py
pyspark/join.py
pyspark/tests.py
pyspark/files.py
pyspark/conf.py
pyspark/rdd.py
pyspark/storagelevel.py
pyspark/statcounter.py
pyspark/shell.py
pyspark/worker.py
"""

4) All my nodes should be running java 7, so probably this is not related.
5) I'll do it in a bit.

Any ideas on 3)?

Thanks.
-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]> wrote:
Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try:

1) Run a simple pyspark shell with yarn-client, and do "sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and the Java class files.

3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0

5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:

Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in the error message something like this:
"""
Error from python worker:
  /usr/bin/python: No module named pyspark
"""

Am I supposed to install pyspark on every worker node?

Thanks.
-Simon


Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen
So, I did specify SPARK_JAR in my pyspark prog. I also checked the workers, it seems that the jar file is distributed and included in classpath correctly.

I think the problem is likely at step 3..

I build my jar file with maven, like this:
"mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0-cdh5.0.1 -DskipTests clean package"

Anything that I might have missed?

Thanks.
-Simon


On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen <[hidden email]> wrote:
1) yes, that sc.parallelize(range(10)).count() has the same error.

2) the files seem to be correct

3) I have trouble at this step, "ImportError: No module named pyspark"
but I seem to have files in the jar file:
"""
$ PYTHONPATH=~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar python
>>> import pyspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark

$ jar -tf ~/spark-assembly-1.0.0-hadoop2.3.0-cdh5.0.1.jar pyspark
pyspark/
pyspark/rddsampler.py
pyspark/broadcast.py
pyspark/serializers.py
pyspark/java_gateway.py
pyspark/resultiterable.py
pyspark/accumulators.py
pyspark/sql.py
pyspark/__init__.py
pyspark/daemon.py
pyspark/context.py
pyspark/cloudpickle.py
pyspark/join.py
pyspark/tests.py
pyspark/files.py
pyspark/conf.py
pyspark/rdd.py
pyspark/storagelevel.py
pyspark/statcounter.py
pyspark/shell.py
pyspark/worker.py
"""

4) All my nodes should be running java 7, so probably this is not related.
5) I'll do it in a bit.

Any ideas on 3)?

Thanks.
-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]> wrote:
Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try:

1) Run a simple pyspark shell with yarn-client, and do "sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and the Java class files.

3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0

5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:

Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in the error message something like this:
"""
Error from python worker:
  /usr/bin/python: No module named pyspark
"""

Am I supposed to install pyspark on every worker node?

Thanks.
-Simon



Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen
In reply to this post by Andrew Or-2
I asked several people, no one seems to believe that we can do this:
$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

This following pull request did mention something about generating a zip file for all python related modules:

I've tested that zipped modules can as least be imported via zipimport. 

Any ideas?

-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]> wrote:
Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try:

1) Run a simple pyspark shell with yarn-client, and do "sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and the Java class files.

3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0

5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:

Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in the error message something like this:
"""
Error from python worker:
  /usr/bin/python: No module named pyspark
"""

Am I supposed to install pyspark on every worker node?

Thanks.
-Simon


Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen
OK, my colleague found this:

And my jar file has 70011 files. Fantastic..




On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <[hidden email]> wrote:
I asked several people, no one seems to believe that we can do this:
$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

This following pull request did mention something about generating a zip file for all python related modules:

I've tested that zipped modules can as least be imported via zipimport. 

Any ideas?

-Simon



On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]> wrote:
Hi Simon,

You shouldn't have to install pyspark on every worker node. In YARN mode, pyspark is packaged into your assembly jar and shipped to your executors automatically. This seems like a more general problem. There are a few things to try:

1) Run a simple pyspark shell with yarn-client, and do "sc.parallelize(range(10)).count()" to see if you get the same error
2) If so, check if your assembly jar is compiled correctly. Run

$ jar -tf <path/to/assembly/jar> pyspark
$ jar -tf <path/to/assembly/jar> py4j

to see if the files are there. For Py4j, you need both the python files and the Java class files.

3) If the files are there, try running a simple python shell (not pyspark shell) with the assembly jar on the PYTHONPATH:

$ PYTHONPATH=/path/to/assembly/jar python
>>> import pyspark

4) If that works, try it on every worker node. If it doesn't work, there is probably something wrong with your jar.

There is a known issue for PySpark on YARN - jars built with Java 7 cannot be properly opened by Java 6. I would either verify that the JAVA_HOME set on all of your workers points to Java 7 (by setting SPARK_YARN_USER_ENV), or simply build your jar with Java 6:

$ cd /path/to/spark/home
$ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop 2.3.0-cdh5.0.0

5) You can check out http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, which has more detailed information about how to debug running an application on YARN in general. In my experience, the steps outlined there are quite useful.

Let me know if you get it working (or not).

Cheers,
Andrew



2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:

Hi folks,

I have a weird problem when using pyspark with yarn. I started ipython as follows:

IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors 4 --executor-memory 4G

When I create a notebook, I can see workers being created and indeed I see spark UI running on my client machine on port 4040.

I have the following simple script:
"""
import pyspark
data = sc.textFile("hdfs://test/tmp/data/*").cache()
oneday = data.map(lambda line: line.split(",")).\
              map(lambda f: (f[0], float(f[1]))).\
              filter(lambda t: t[0] >= "2013-01-01" and t[0] < "2013-01-02").\
              map(lambda t: (parser.parse(t[0]), t[1]))
oneday.take(1)
"""

By executing this, I see that it is my client machine (where ipython is launched) is reading all the data from HDFS, and produce the result of take(1), rather than my worker nodes...

When I do "data.count()", things would blow up altogether. But I do see in the error message something like this:
"""
Error from python worker:
  /usr/bin/python: No module named pyspark
"""

Am I supposed to install pyspark on every worker node?

Thanks.
-Simon



Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Patrick Wendell
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:

https://github.com/apache/spark/blob/master/make-distribution.sh#L102

Any luck if you use JDK 6 to compile?


On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <[hidden email]> wrote:

> OK, my colleague found this:
> https://mail.python.org/pipermail/python-list/2014-May/671353.html
>
> And my jar file has 70011 files. Fantastic..
>
>
>
>
> On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <[hidden email]> wrote:
>>
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> This following pull request did mention something about generating a zip
>> file for all python related modules:
>> https://www.mail-archive.com/reviews@.../msg08223.html
>>
>> I've tested that zipped modules can as least be imported via zipimport.
>>
>> Any ideas?
>>
>> -Simon
>>
>>
>>
>> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]> wrote:
>>>
>>> Hi Simon,
>>>
>>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>>> pyspark is packaged into your assembly jar and shipped to your executors
>>> automatically. This seems like a more general problem. There are a few
>>> things to try:
>>>
>>> 1) Run a simple pyspark shell with yarn-client, and do
>>> "sc.parallelize(range(10)).count()" to see if you get the same error
>>> 2) If so, check if your assembly jar is compiled correctly. Run
>>>
>>> $ jar -tf <path/to/assembly/jar> pyspark
>>> $ jar -tf <path/to/assembly/jar> py4j
>>>
>>> to see if the files are there. For Py4j, you need both the python files
>>> and the Java class files.
>>>
>>> 3) If the files are there, try running a simple python shell (not pyspark
>>> shell) with the assembly jar on the PYTHONPATH:
>>>
>>> $ PYTHONPATH=/path/to/assembly/jar python
>>> >>> import pyspark
>>>
>>> 4) If that works, try it on every worker node. If it doesn't work, there
>>> is probably something wrong with your jar.
>>>
>>> There is a known issue for PySpark on YARN - jars built with Java 7
>>> cannot be properly opened by Java 6. I would either verify that the
>>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>>
>>> $ cd /path/to/spark/home
>>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>>> 2.3.0-cdh5.0.0
>>>
>>> 5) You can check out
>>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>>> which has more detailed information about how to debug running an
>>> application on YARN in general. In my experience, the steps outlined there
>>> are quite useful.
>>>
>>> Let me know if you get it working (or not).
>>>
>>> Cheers,
>>> Andrew
>>>
>>>
>>>
>>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:
>>>
>>>> Hi folks,
>>>>
>>>> I have a weird problem when using pyspark with yarn. I started ipython
>>>> as follows:
>>>>
>>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>>> --num-executors 4 --executor-memory 4G
>>>>
>>>> When I create a notebook, I can see workers being created and indeed I
>>>> see spark UI running on my client machine on port 4040.
>>>>
>>>> I have the following simple script:
>>>> """
>>>> import pyspark
>>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>>> oneday = data.map(lambda line: line.split(",")).\
>>>>               map(lambda f: (f[0], float(f[1]))).\
>>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>>> "2013-01-02").\
>>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>>> oneday.take(1)
>>>> """
>>>>
>>>> By executing this, I see that it is my client machine (where ipython is
>>>> launched) is reading all the data from HDFS, and produce the result of
>>>> take(1), rather than my worker nodes...
>>>>
>>>> When I do "data.count()", things would blow up altogether. But I do see
>>>> in the error message something like this:
>>>> """
>>>>
>>>> Error from python worker:
>>>>   /usr/bin/python: No module named pyspark
>>>>
>>>> """
>>>>
>>>>
>>>> Am I supposed to install pyspark on every worker node?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> -Simon
>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen
Nope... didn't try java 6. The standard installation guide didn't say anything about java 7 and suggested to do "-DskipTests" for the build..

So, I didn't see the warning message...


On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <[hidden email]> wrote:
Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:

https://github.com/apache/spark/blob/master/make-distribution.sh#L102

Any luck if you use JDK 6 to compile?


On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <[hidden email]> wrote:
> OK, my colleague found this:
> https://mail.python.org/pipermail/python-list/2014-May/671353.html
>
> And my jar file has 70011 files. Fantastic..
>
>
>
>
> On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <[hidden email]> wrote:
>>
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> This following pull request did mention something about generating a zip
>> file for all python related modules:
>> https://www.mail-archive.com/reviews@.../msg08223.html
>>
>> I've tested that zipped modules can as least be imported via zipimport.
>>
>> Any ideas?
>>
>> -Simon
>>
>>
>>
>> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]> wrote:
>>>
>>> Hi Simon,
>>>
>>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>>> pyspark is packaged into your assembly jar and shipped to your executors
>>> automatically. This seems like a more general problem. There are a few
>>> things to try:
>>>
>>> 1) Run a simple pyspark shell with yarn-client, and do
>>> "sc.parallelize(range(10)).count()" to see if you get the same error
>>> 2) If so, check if your assembly jar is compiled correctly. Run
>>>
>>> $ jar -tf <path/to/assembly/jar> pyspark
>>> $ jar -tf <path/to/assembly/jar> py4j
>>>
>>> to see if the files are there. For Py4j, you need both the python files
>>> and the Java class files.
>>>
>>> 3) If the files are there, try running a simple python shell (not pyspark
>>> shell) with the assembly jar on the PYTHONPATH:
>>>
>>> $ PYTHONPATH=/path/to/assembly/jar python
>>> >>> import pyspark
>>>
>>> 4) If that works, try it on every worker node. If it doesn't work, there
>>> is probably something wrong with your jar.
>>>
>>> There is a known issue for PySpark on YARN - jars built with Java 7
>>> cannot be properly opened by Java 6. I would either verify that the
>>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>>
>>> $ cd /path/to/spark/home
>>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>>> 2.3.0-cdh5.0.0
>>>
>>> 5) You can check out
>>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>>> which has more detailed information about how to debug running an
>>> application on YARN in general. In my experience, the steps outlined there
>>> are quite useful.
>>>
>>> Let me know if you get it working (or not).
>>>
>>> Cheers,
>>> Andrew
>>>
>>>
>>>
>>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:
>>>
>>>> Hi folks,
>>>>
>>>> I have a weird problem when using pyspark with yarn. I started ipython
>>>> as follows:
>>>>
>>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>>> --num-executors 4 --executor-memory 4G
>>>>
>>>> When I create a notebook, I can see workers being created and indeed I
>>>> see spark UI running on my client machine on port 4040.
>>>>
>>>> I have the following simple script:
>>>> """
>>>> import pyspark
>>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>>> oneday = data.map(lambda line: line.split(",")).\
>>>>               map(lambda f: (f[0], float(f[1]))).\
>>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>>> "2013-01-02").\
>>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>>> oneday.take(1)
>>>> """
>>>>
>>>> By executing this, I see that it is my client machine (where ipython is
>>>> launched) is reading all the data from HDFS, and produce the result of
>>>> take(1), rather than my worker nodes...
>>>>
>>>> When I do "data.count()", things would blow up altogether. But I do see
>>>> in the error message something like this:
>>>> """
>>>>
>>>> Error from python worker:
>>>>   /usr/bin/python: No module named pyspark
>>>>
>>>> """
>>>>
>>>>
>>>> Am I supposed to install pyspark on every worker node?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> -Simon
>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Patrick Wendell
Yeah we need to add a build warning to the Maven build. Would you be
able to try compiling Spark with Java 6? It would be good to narrow
down if you hare hitting this problem or something else.

On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen <[hidden email]> wrote:

> Nope... didn't try java 6. The standard installation guide didn't say
> anything about java 7 and suggested to do "-DskipTests" for the build..
> http://spark.apache.org/docs/latest/building-with-maven.html
>
> So, I didn't see the warning message...
>
>
> On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <[hidden email]> wrote:
>>
>> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
>> Zip format and Java 7 uses Zip64. I think we've tried to add some
>> build warnings if Java 7 is used, for this reason:
>>
>> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
>>
>> Any luck if you use JDK 6 to compile?
>>
>>
>> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <[hidden email]>
>> wrote:
>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..
>> >
>> >
>> >
>> >
>> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <[hidden email]>
>> > wrote:
>> >>
>> >> I asked several people, no one seems to believe that we can do this:
>> >> $ PYTHONPATH=/path/to/assembly/jar python
>> >> >>> import pyspark
>> >>
>> >> This following pull request did mention something about generating a
>> >> zip
>> >> file for all python related modules:
>> >> https://www.mail-archive.com/reviews@.../msg08223.html
>> >>
>> >> I've tested that zipped modules can as least be imported via zipimport.
>> >>
>> >> Any ideas?
>> >>
>> >> -Simon
>> >>
>> >>
>> >>
>> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]>
>> >> wrote:
>> >>>
>> >>> Hi Simon,
>> >>>
>> >>> You shouldn't have to install pyspark on every worker node. In YARN
>> >>> mode,
>> >>> pyspark is packaged into your assembly jar and shipped to your
>> >>> executors
>> >>> automatically. This seems like a more general problem. There are a few
>> >>> things to try:
>> >>>
>> >>> 1) Run a simple pyspark shell with yarn-client, and do
>> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> >>> 2) If so, check if your assembly jar is compiled correctly. Run
>> >>>
>> >>> $ jar -tf <path/to/assembly/jar> pyspark
>> >>> $ jar -tf <path/to/assembly/jar> py4j
>> >>>
>> >>> to see if the files are there. For Py4j, you need both the python
>> >>> files
>> >>> and the Java class files.
>> >>>
>> >>> 3) If the files are there, try running a simple python shell (not
>> >>> pyspark
>> >>> shell) with the assembly jar on the PYTHONPATH:
>> >>>
>> >>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> >>> import pyspark
>> >>>
>> >>> 4) If that works, try it on every worker node. If it doesn't work,
>> >>> there
>> >>> is probably something wrong with your jar.
>> >>>
>> >>> There is a known issue for PySpark on YARN - jars built with Java 7
>> >>> cannot be properly opened by Java 6. I would either verify that the
>> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>> >>>
>> >>> $ cd /path/to/spark/home
>> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> >>> 2.3.0-cdh5.0.0
>> >>>
>> >>> 5) You can check out
>> >>>
>> >>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> >>> which has more detailed information about how to debug running an
>> >>> application on YARN in general. In my experience, the steps outlined
>> >>> there
>> >>> are quite useful.
>> >>>
>> >>> Let me know if you get it working (or not).
>> >>>
>> >>> Cheers,
>> >>> Andrew
>> >>>
>> >>>
>> >>>
>> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:
>> >>>
>> >>>> Hi folks,
>> >>>>
>> >>>> I have a weird problem when using pyspark with yarn. I started
>> >>>> ipython
>> >>>> as follows:
>> >>>>
>> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>> >>>> --num-executors 4 --executor-memory 4G
>> >>>>
>> >>>> When I create a notebook, I can see workers being created and indeed
>> >>>> I
>> >>>> see spark UI running on my client machine on port 4040.
>> >>>>
>> >>>> I have the following simple script:
>> >>>> """
>> >>>> import pyspark
>> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>> >>>> oneday = data.map(lambda line: line.split(",")).\
>> >>>>               map(lambda f: (f[0], float(f[1]))).\
>> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>> >>>> "2013-01-02").\
>> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>> >>>> oneday.take(1)
>> >>>> """
>> >>>>
>> >>>> By executing this, I see that it is my client machine (where ipython
>> >>>> is
>> >>>> launched) is reading all the data from HDFS, and produce the result
>> >>>> of
>> >>>> take(1), rather than my worker nodes...
>> >>>>
>> >>>> When I do "data.count()", things would blow up altogether. But I do
>> >>>> see
>> >>>> in the error message something like this:
>> >>>> """
>> >>>>
>> >>>> Error from python worker:
>> >>>>   /usr/bin/python: No module named pyspark
>> >>>>
>> >>>> """
>> >>>>
>> >>>>
>> >>>> Am I supposed to install pyspark on every worker node?
>> >>>>
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> -Simon
>> >>>
>> >>>
>> >>
>> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Andrew Or-2
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark

That is because people usually don't package python files into their jars. For pyspark, however, this will work as long as the jar can be opened and its contents can be read. In my experience, if I am able to import the pyspark module by explicitly specifying the PYTHONPATH this way, then I can run pyspark on YARN without fail.

>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..

It seems that this problem is not specific to running Java 6 on a Java 7 jar. We definitely need to document and warn against Java 7 jars more aggressively. For now, please do try building the jar with Java 6.



2014-06-03 4:42 GMT+02:00 Patrick Wendell <[hidden email]>:
Yeah we need to add a build warning to the Maven build. Would you be
able to try compiling Spark with Java 6? It would be good to narrow
down if you hare hitting this problem or something else.

On Mon, Jun 2, 2014 at 1:15 PM, Xu (Simon) Chen <[hidden email]> wrote:
> Nope... didn't try java 6. The standard installation guide didn't say
> anything about java 7 and suggested to do "-DskipTests" for the build..
> http://spark.apache.org/docs/latest/building-with-maven.html
>
> So, I didn't see the warning message...
>
>
> On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <[hidden email]> wrote:
>>
>> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
>> Zip format and Java 7 uses Zip64. I think we've tried to add some
>> build warnings if Java 7 is used, for this reason:
>>
>> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
>>
>> Any luck if you use JDK 6 to compile?
>>
>>
>> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <[hidden email]>
>> wrote:
>> > OK, my colleague found this:
>> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
>> >
>> > And my jar file has 70011 files. Fantastic..
>> >
>> >
>> >
>> >
>> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <[hidden email]>
>> > wrote:
>> >>
>> >> I asked several people, no one seems to believe that we can do this:
>> >> $ PYTHONPATH=/path/to/assembly/jar python
>> >> >>> import pyspark
>> >>
>> >> This following pull request did mention something about generating a
>> >> zip
>> >> file for all python related modules:
>> >> https://www.mail-archive.com/reviews@.../msg08223.html
>> >>
>> >> I've tested that zipped modules can as least be imported via zipimport.
>> >>
>> >> Any ideas?
>> >>
>> >> -Simon
>> >>
>> >>
>> >>
>> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <[hidden email]>
>> >> wrote:
>> >>>
>> >>> Hi Simon,
>> >>>
>> >>> You shouldn't have to install pyspark on every worker node. In YARN
>> >>> mode,
>> >>> pyspark is packaged into your assembly jar and shipped to your
>> >>> executors
>> >>> automatically. This seems like a more general problem. There are a few
>> >>> things to try:
>> >>>
>> >>> 1) Run a simple pyspark shell with yarn-client, and do
>> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
>> >>> 2) If so, check if your assembly jar is compiled correctly. Run
>> >>>
>> >>> $ jar -tf <path/to/assembly/jar> pyspark
>> >>> $ jar -tf <path/to/assembly/jar> py4j
>> >>>
>> >>> to see if the files are there. For Py4j, you need both the python
>> >>> files
>> >>> and the Java class files.
>> >>>
>> >>> 3) If the files are there, try running a simple python shell (not
>> >>> pyspark
>> >>> shell) with the assembly jar on the PYTHONPATH:
>> >>>
>> >>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> >>> import pyspark
>> >>>
>> >>> 4) If that works, try it on every worker node. If it doesn't work,
>> >>> there
>> >>> is probably something wrong with your jar.
>> >>>
>> >>> There is a known issue for PySpark on YARN - jars built with Java 7
>> >>> cannot be properly opened by Java 6. I would either verify that the
>> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>> >>>
>> >>> $ cd /path/to/spark/home
>> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>> >>> 2.3.0-cdh5.0.0
>> >>>
>> >>> 5) You can check out
>> >>>
>> >>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>> >>> which has more detailed information about how to debug running an
>> >>> application on YARN in general. In my experience, the steps outlined
>> >>> there
>> >>> are quite useful.
>> >>>
>> >>> Let me know if you get it working (or not).
>> >>>
>> >>> Cheers,
>> >>> Andrew
>> >>>
>> >>>
>> >>>
>> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <[hidden email]>:
>> >>>
>> >>>> Hi folks,
>> >>>>
>> >>>> I have a weird problem when using pyspark with yarn. I started
>> >>>> ipython
>> >>>> as follows:
>> >>>>
>> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>> >>>> --num-executors 4 --executor-memory 4G
>> >>>>
>> >>>> When I create a notebook, I can see workers being created and indeed
>> >>>> I
>> >>>> see spark UI running on my client machine on port 4040.
>> >>>>
>> >>>> I have the following simple script:
>> >>>> """
>> >>>> import pyspark
>> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>> >>>> oneday = data.map(lambda line: line.split(",")).\
>> >>>>               map(lambda f: (f[0], float(f[1]))).\
>> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>> >>>> "2013-01-02").\
>> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>> >>>> oneday.take(1)
>> >>>> """
>> >>>>
>> >>>> By executing this, I see that it is my client machine (where ipython
>> >>>> is
>> >>>> launched) is reading all the data from HDFS, and produce the result
>> >>>> of
>> >>>> take(1), rather than my worker nodes...
>> >>>>
>> >>>> When I do "data.count()", things would blow up altogether. But I do
>> >>>> see
>> >>>> in the error message something like this:
>> >>>> """
>> >>>>
>> >>>> Error from python worker:
>> >>>>   /usr/bin/python: No module named pyspark
>> >>>>
>> >>>> """
>> >>>>
>> >>>>
>> >>>> Am I supposed to install pyspark on every worker node?
>> >>>>
>> >>>>
>> >>>> Thanks.
>> >>>>
>> >>>> -Simon
>> >>>
>> >>>
>> >>
>> >
>
>