trouble with broadcast variables on pyspark

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

trouble with broadcast variables on pyspark

Sandy Ryza
I'm running into an issue when trying to broadcast large variables with pyspark.

A ~1GB array seems to be blowing up beyond the size of the driver machine's memory when it's pickled.

I've tried to get around this by broadcasting smaller chunks of it one at a time.  But I'm still running out of memory, ostensibly because the intermediate pickled versions aren't getting garbage collected.

Any ideas on how to get around this?  Is this some sort of py4j limitation?  Is there any reason that the Spark driver would be keeping the pickled version around? 

thanks in advance for any help,
Sandy
Reply | Threaded
Open this post in threaded view
|

Re: trouble with broadcast variables on pyspark

Josh Rosen
I've opened an issue for this on JIRA: https://spark-project.atlassian.net/browse/SPARK-1065

To clarify, is the driver JVM running out of memory with an OutOfMemoryError?  Or is the Python process exceeding some memory limit?


On Fri, Feb 7, 2014 at 12:16 AM, Sandy Ryza <[hidden email]> wrote:
I'm running into an issue when trying to broadcast large variables with pyspark.

A ~1GB array seems to be blowing up beyond the size of the driver machine's memory when it's pickled.

I've tried to get around this by broadcasting smaller chunks of it one at a time.  But I'm still running out of memory, ostensibly because the intermediate pickled versions aren't getting garbage collected.

Any ideas on how to get around this?  Is this some sort of py4j limitation?  Is there any reason that the Spark driver would be keeping the pickled version around? 

thanks in advance for any help,
Sandy

Reply | Threaded
Open this post in threaded view
|

Re: trouble with broadcast variables on pyspark

Sandy Ryza
Thanks Josh.  The driver JVM is hitting OutOfMemoryErrors, but the python process is taking even more memory.  I added more details to the JIRA.


On Fri, Feb 7, 2014 at 11:45 AM, Josh Rosen <[hidden email]> wrote:
I've opened an issue for this on JIRA: https://spark-project.atlassian.net/browse/SPARK-1065

To clarify, is the driver JVM running out of memory with an OutOfMemoryError?  Or is the Python process exceeding some memory limit?


On Fri, Feb 7, 2014 at 12:16 AM, Sandy Ryza <[hidden email]> wrote:
I'm running into an issue when trying to broadcast large variables with pyspark.

A ~1GB array seems to be blowing up beyond the size of the driver machine's memory when it's pickled.

I've tried to get around this by broadcasting smaller chunks of it one at a time.  But I'm still running out of memory, ostensibly because the intermediate pickled versions aren't getting garbage collected.

Any ideas on how to get around this?  Is this some sort of py4j limitation?  Is there any reason that the Spark driver would be keeping the pickled version around? 

thanks in advance for any help,
Sandy


Reply | Threaded
Open this post in threaded view
|

Re: trouble with broadcast variables on pyspark

aazout
This post has NOT been accepted by the mailing list yet.
Has this issue been resolved?
CEO / Velos (velos.io)