pyspark broadcast error

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

pyspark broadcast error

bmiller1
Hi All,

When I run the program shown below, I receive the error shown below.
I am running the current version of branch-0.9 from github.  Note that
I do not receive the error when I replace "2 ** 29" with "2 ** X",
where X < 29.  More interestingly, I do not receive the error when X =
30, and when X > 30 the code either crashes with "Memory Error" or
"Py4JNetworkError: An error occurred while trying to connect to the
Java server".

I am aware that there are some bugs
(https://spark-project.atlassian.net/browse/SPARK-1065) related to
memory consumption with pyspark and broadcasting, but the behavior
with X = 29 seemed different and I was wondering if anybody had any
insight.

-Brad

*Program*
from pyspark import SparkContext
SparkContext.setSystemProperty('spark.executor.memory', '25g')
sc = SparkContext('spark://crosby.research.intel-research.net:7077',
'FeatureExtraction')
meg_512 = range((2 ** 29) / 8)
tmp_broad = sc.broadcast(meg_512)

*Error*
---------------------------------------------------------------------------
Py4JError                                 Traceback (most recent call last)
<ipython-input-1-db8033dee301> in <module>()
      3 sc = SparkContext('spark://crosby.research.intel-research.net:7077',
'FeatureExtraction')
      4 meg_1024 = range((2 ** 29) / 8)
----> 5 tmp_broad = sc.broadcast(meg_1024)

/home/spark/spark-branch-0.9/python/pyspark/context.py in broadcast(self, value)
    277         pickleSer = PickleSerializer()
    278         pickled = pickleSer.dumps(value)
--> 279         jbroadcast = self._jsc.broadcast(bytearray(pickled))
    280         return Broadcast(jbroadcast.id(), value, jbroadcast,
    281                          self._pickled_broadcast_vars)

/home/spark/spark-branch-0.9/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py
in __call__(self, *args)
    535         answer = self.gateway_client.send_command(command)
    536         return_value = get_return_value(answer, self.gateway_client,
--> 537                 self.target_id, self.name)
    538
    539         for temp_arg in temp_args:

/home/spark/spark-branch-0.9/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)
    302                 raise Py4JError(
    303                     'An error occurred while calling
{0}{1}{2}. Trace:\n{3}\n'.
--> 304                     format(target_id, '.', name, value))
    305         else:
    306             raise Py4JError(

Py4JError: An error occurred while calling o7.broadcast. Trace:
java.lang.NegativeArraySizeException
at py4j.Base64.decode(Base64.java:292)
at py4j.Protocol.getBytes(Protocol.java:167)
at py4j.Protocol.getObject(Protocol.java:276)
at py4j.commands.AbstractCommand.getArguments(AbstractCommand.java:81)
at py4j.commands.CallCommand.execute(CallCommand.java:77)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:701)