Python 2.7 + numpy break sortByKey()

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Python 2.7 + numpy break sortByKey()

Nick Chammas
Unexpected behavior. Here's the repro:
  1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance type.
  2. Upgrade the cluster to Python 2.7 using the instructions here.
  3. pip2.7 install numpy
  4. Run this script in the pyspark shell:

    wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo/sample/wiki/pagecounts-20100212-050000.gz')
    wikistat = wikistat.map(lambda x: x.split(' ')).cache()
    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).sortByKey(False).take(5)

  5. You will see a long error output that includes a complaint about NumPy not being installed.
  6. Now remove the sortByKey() from that last line and rerun it.

    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).take(5)

    You should see your results without issue. So it's the sortByKey() that's choking.
  7. Quit the pyspark shell and pip uninstall numpy.
  8. Rerun the three lines from step 4. Enjoy your sorted results error-free.
Can anyone else reproduce this issue? Is it a bug? I don't see it if I leave the cluster on the default Python 2.6.8.

Installing numpy on the slave via pssh and pip2.7 (so that it's identical to the master) does not fix the issue. Dunno if installing Python packages everywhere is even necessary though.

Nick

Reply | Threaded
Open this post in threaded view
|

Re: Python 2.7 + numpy break sortByKey()

Nick Chammas
So this issue appears to be related to the other Python 2.7-related issue I reported in this thread.

Shall I open a bug in JIRA about this and include the wikistat repro?

Nick


On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas <[hidden email]> wrote:
Unexpected behavior. Here's the repro:
  1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance type.
  2. Upgrade the cluster to Python 2.7 using the instructions here.
  3. pip2.7 install numpy
  4. Run this script in the pyspark shell:

    wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo/sample/wiki/pagecounts-20100212-050000.gz')
    wikistat = wikistat.map(lambda x: x.split(' ')).cache()
    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).sortByKey(False).take(5)

  5. You will see a long error output that includes a complaint about NumPy not being installed.
  6. Now remove the sortByKey() from that last line and rerun it.

    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).take(5)

    You should see your results without issue. So it's the sortByKey() that's choking.
  7. Quit the pyspark shell and pip uninstall numpy.
  8. Rerun the three lines from step 4. Enjoy your sorted results error-free.
Can anyone else reproduce this issue? Is it a bug? I don't see it if I leave the cluster on the default Python 2.6.8.

Installing numpy on the slave via pssh and pip2.7 (so that it's identical to the master) does not fix the issue. Dunno if installing Python packages everywhere is even necessary though.

Nick



View this message in context: Python 2.7 + numpy break sortByKey()
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Python 2.7 + numpy break sortByKey()

Nick Chammas
Devs? Is this an issue for you that deserves a ticket?


On Sun, Mar 2, 2014 at 4:32 PM, Nicholas Chammas <[hidden email]> wrote:
So this issue appears to be related to the other Python 2.7-related issue I reported in this thread.

Shall I open a bug in JIRA about this and include the wikistat repro?

Nick


On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas <[hidden email]> wrote:
Unexpected behavior. Here's the repro:
  1. Launch an EC2 cluster with spark-ec2. 1 slave; default instance type.
  2. Upgrade the cluster to Python 2.7 using the instructions here.
  3. pip2.7 install numpy
  4. Run this script in the pyspark shell:

    wikistat = sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo/sample/wiki/pagecounts-20100212-050000.gz')
    wikistat = wikistat.map(lambda x: x.split(' ')).cache()
    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).sortByKey(False).take(5)

  5. You will see a long error output that includes a complaint about NumPy not being installed.
  6. Now remove the sortByKey() from that last line and rerun it.

    wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x: (x[1],x[0])).take(5)

    You should see your results without issue. So it's the sortByKey() that's choking.
  7. Quit the pyspark shell and pip uninstall numpy.
  8. Rerun the three lines from step 4. Enjoy your sorted results error-free.
Can anyone else reproduce this issue? Is it a bug? I don't see it if I leave the cluster on the default Python 2.6.8.

Installing numpy on the slave via pssh and pip2.7 (so that it's identical to the master) does not fix the issue. Dunno if installing Python packages everywhere is even necessary though.

Nick



View this message in context: Python 2.7 + numpy break sortByKey()
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Python 2.7 + numpy break sortByKey()

Patrick Wendell
The difference between your two jobs is that take() is optimized and
only runs on the machine where you are using the shell, whereas
sortByKey requires using many machines. It seems like maybe python
didn't get upgraded correctly on one of the slaves. I would look in
the /root/spark/work/ folder (find the most recent application log) on
each slave and see which slave is logging the error message.

On Wed, Mar 5, 2014 at 9:02 AM, Nicholas Chammas
<[hidden email]> wrote:

> Devs? Is this an issue for you that deserves a ticket?
>
>
> On Sun, Mar 2, 2014 at 4:32 PM, Nicholas Chammas
> <[hidden email]> wrote:
>>
>> So this issue appears to be related to the other Python 2.7-related issue
>> I reported in this thread.
>>
>> Shall I open a bug in JIRA about this and include the wikistat repro?
>>
>> Nick
>>
>>
>> On Sun, Mar 2, 2014 at 1:50 AM, nicholas.chammas
>> <[hidden email]> wrote:
>>>
>>> Unexpected behavior. Here's the repro:
>>>
>>> Launch an EC2 cluster with spark-ec2. 1 slave; default instance type.
>>> Upgrade the cluster to Python 2.7 using the instructions here.
>>> pip2.7 install numpy
>>> Run this script in the pyspark shell:
>>>
>>> wikistat =
>>> sc.textFile('s3n://ACCESSKEY:SECRET@bigdatademo/sample/wiki/pagecounts-20100212-050000.gz')
>>> wikistat = wikistat.map(lambda x: x.split(' ')).cache()
>>> wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
>>> (x[1],x[0])).sortByKey(False).take(5)
>>>
>>> You will see a long error output that includes a complaint about NumPy
>>> not being installed.
>>> Now remove the sortByKey() from that last line and rerun it.
>>>
>>> wikistat.map(lambda x: (x[1], int(x[3]))).map(lambda x:
>>> (x[1],x[0])).take(5)
>>>
>>> You should see your results without issue. So it's the sortByKey() that's
>>> choking.
>>> Quit the pyspark shell and pip uninstall numpy.
>>> Rerun the three lines from step 4. Enjoy your sorted results error-free.
>>>
>>> Can anyone else reproduce this issue? Is it a bug? I don't see it if I
>>> leave the cluster on the default Python 2.6.8.
>>>
>>> Installing numpy on the slave via pssh and pip2.7 (so that it's identical
>>> to the master) does not fix the issue. Dunno if installing Python packages
>>> everywhere is even necessary though.
>>>
>>> Nick
>>>
>>>
>>> ________________________________
>>> View this message in context: Python 2.7 + numpy break sortByKey()
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>
>