pyspark - memory leak leading to OOM after submitting 100 jobs?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

pyspark - memory leak leading to OOM after submitting 100 jobs?

Paul Wais
Dear List,

I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode.  Each job is essentially a create RDD -> create DF
-> write DF sort of flow.  The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak."  Here's
pseudocode:

```
row_rdds = []
for i in range(100):
  row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
  row_rdds.append(row_rdd)

for row_rdd in row_rdds:
  df = spark.createDataFrame(row_rdd)
  df.persist()
  print(df.count())
  df.write.save(...) # Save parquet
  df.unpersist()

  # Does not help:
  # del df
  # del row_rdd
```

In my real application:
 * rows are much larger, perhaps 1MB each
 * row_rdds are sized to fit available RAM

I observe that after 100 or so iterations of the second loop (each of
which creates a "job" in the Spark WebUI), the following happens:
 * pyspark workers have fairly stable resident and virtual RAM usage
 * java process eventually approaches resident RAM cap (8GB standard)
but virtual RAM usage keeps ballooning.

Eventually the machine runs out of RAM and the linux OOM killer kills
the java process, resulting in an "IndexError: pop from an empty
deque" error from py4j/java_gateway.py .


Does anybody have any ideas about what's going on?  Note that this is
local mode.  I have personally run standalone masters and submitted a
ton of jobs and never seen something like this over time.  Those were
very different jobs, but perhaps this issue is bespoke to local mode?

Emphasis: I did try to del the pyspark objects and run python GC.
That didn't help at all.

pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)

12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).

Cheers,
-Paul

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Jungtaek Lim-2
Honestly I'd recommend you to spend you time to look into the issue, via taking memory dump per some interval and compare differences (at least share these dump files to community with redacting if necessary). Otherwise someone has to try to reproduce without reproducer and even couldn't reproduce even they spent their time. Memory leak issue is not really easy to reproduce, unless it leaks some objects without any conditions.

- Jungtaek Lim (HeartSaVioR)

On Sun, Oct 20, 2019 at 7:18 PM Paul Wais <[hidden email]> wrote:
Dear List,

I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode.  Each job is essentially a create RDD -> create DF
-> write DF sort of flow.  The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak."  Here's
pseudocode:

```
row_rdds = []
for i in range(100):
  row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
  row_rdds.append(row_rdd)

for row_rdd in row_rdds:
  df = spark.createDataFrame(row_rdd)
  df.persist()
  print(df.count())
  df.write.save(...) # Save parquet
  df.unpersist()

  # Does not help:
  # del df
  # del row_rdd
```

In my real application:
 * rows are much larger, perhaps 1MB each
 * row_rdds are sized to fit available RAM

I observe that after 100 or so iterations of the second loop (each of
which creates a "job" in the Spark WebUI), the following happens:
 * pyspark workers have fairly stable resident and virtual RAM usage
 * java process eventually approaches resident RAM cap (8GB standard)
but virtual RAM usage keeps ballooning.

Eventually the machine runs out of RAM and the linux OOM killer kills
the java process, resulting in an "IndexError: pop from an empty
deque" error from py4j/java_gateway.py .


Does anybody have any ideas about what's going on?  Note that this is
local mode.  I have personally run standalone masters and submitted a
ton of jobs and never seen something like this over time.  Those were
very different jobs, but perhaps this issue is bespoke to local mode?

Emphasis: I did try to del the pyspark objects and run python GC.
That didn't help at all.

pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)

12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).

Cheers,
-Paul

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Paul Wais
Well, dumb question:

Given the workflow outlined above, should Local Mode keep running?  Or
is the leak a known issue?  I just wanted to check because I can't
recall seeing this issue with a non-local master, though it's possible
there were task failures that hid the issue.

If this issue looks new, what's the easiest way to record memory dumps
or do profiling?  Can I put something in my spark-defaults.conf ?

The code is open source and the run is reproducible, although the
specific test currently requires a rather large (but public) dataset.

On Sun, Oct 20, 2019 at 6:24 PM Jungtaek Lim
<[hidden email]> wrote:

>
> Honestly I'd recommend you to spend you time to look into the issue, via taking memory dump per some interval and compare differences (at least share these dump files to community with redacting if necessary). Otherwise someone has to try to reproduce without reproducer and even couldn't reproduce even they spent their time. Memory leak issue is not really easy to reproduce, unless it leaks some objects without any conditions.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Sun, Oct 20, 2019 at 7:18 PM Paul Wais <[hidden email]> wrote:
>>
>> Dear List,
>>
>> I've observed some sort of memory leak when using pyspark to run ~100
>> jobs in local mode.  Each job is essentially a create RDD -> create DF
>> -> write DF sort of flow.  The RDD and DFs go out of scope after each
>> job completes, hence I call this issue a "memory leak."  Here's
>> pseudocode:
>>
>> ```
>> row_rdds = []
>> for i in range(100):
>>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>>   row_rdds.append(row_rdd)
>>
>> for row_rdd in row_rdds:
>>   df = spark.createDataFrame(row_rdd)
>>   df.persist()
>>   print(df.count())
>>   df.write.save(...) # Save parquet
>>   df.unpersist()
>>
>>   # Does not help:
>>   # del df
>>   # del row_rdd
>> ```
>>
>> In my real application:
>>  * rows are much larger, perhaps 1MB each
>>  * row_rdds are sized to fit available RAM
>>
>> I observe that after 100 or so iterations of the second loop (each of
>> which creates a "job" in the Spark WebUI), the following happens:
>>  * pyspark workers have fairly stable resident and virtual RAM usage
>>  * java process eventually approaches resident RAM cap (8GB standard)
>> but virtual RAM usage keeps ballooning.
>>
>> Eventually the machine runs out of RAM and the linux OOM killer kills
>> the java process, resulting in an "IndexError: pop from an empty
>> deque" error from py4j/java_gateway.py .
>>
>>
>> Does anybody have any ideas about what's going on?  Note that this is
>> local mode.  I have personally run standalone masters and submitted a
>> ton of jobs and never seen something like this over time.  Those were
>> very different jobs, but perhaps this issue is bespoke to local mode?
>>
>> Emphasis: I did try to del the pyspark objects and run python GC.
>> That didn't help at all.
>>
>> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>>
>> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>>
>> Cheers,
>> -Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Nicolas Paris-2
In reply to this post by Paul Wais
have you deactivated the spark.ui ?
I have read several thread explaining the ui can lead to OOM because it
stores 1000 dags by default


On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:

> Dear List,
>
> I've observed some sort of memory leak when using pyspark to run ~100
> jobs in local mode.  Each job is essentially a create RDD -> create DF
> -> write DF sort of flow.  The RDD and DFs go out of scope after each
> job completes, hence I call this issue a "memory leak."  Here's
> pseudocode:
>
> ```
> row_rdds = []
> for i in range(100):
>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>   row_rdds.append(row_rdd)
>
> for row_rdd in row_rdds:
>   df = spark.createDataFrame(row_rdd)
>   df.persist()
>   print(df.count())
>   df.write.save(...) # Save parquet
>   df.unpersist()
>
>   # Does not help:
>   # del df
>   # del row_rdd
> ```
>
> In my real application:
>  * rows are much larger, perhaps 1MB each
>  * row_rdds are sized to fit available RAM
>
> I observe that after 100 or so iterations of the second loop (each of
> which creates a "job" in the Spark WebUI), the following happens:
>  * pyspark workers have fairly stable resident and virtual RAM usage
>  * java process eventually approaches resident RAM cap (8GB standard)
> but virtual RAM usage keeps ballooning.
>
> Eventually the machine runs out of RAM and the linux OOM killer kills
> the java process, resulting in an "IndexError: pop from an empty
> deque" error from py4j/java_gateway.py .
>
>
> Does anybody have any ideas about what's going on?  Note that this is
> local mode.  I have personally run standalone masters and submitted a
> ton of jobs and never seen something like this over time.  Those were
> very different jobs, but perhaps this issue is bespoke to local mode?
>
> Emphasis: I did try to del the pyspark objects and run python GC.
> That didn't help at all.
>
> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>
> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>
> Cheers,
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Holden Karau


On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris <[hidden email]> wrote:
have you deactivated the spark.ui ?
I have read several thread explaining the ui can lead to OOM because it
stores 1000 dags by default


On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> Dear List,
>
> I've observed some sort of memory leak when using pyspark to run ~100
> jobs in local mode.  Each job is essentially a create RDD -> create DF
> -> write DF sort of flow.  The RDD and DFs go out of scope after each
> job completes, hence I call this issue a "memory leak."  Here's
> pseudocode:
>
> ```
> row_rdds = []
> for i in range(100):
>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>   row_rdds.append(row_rdd)
>
> for row_rdd in row_rdds:
>   df = spark.createDataFrame(row_rdd)
>   df.persist()
>   print(df.count())
>   df.write.save(...) # Save parquet
>   df.unpersist()
>
>   # Does not help:
>   # del df
>   # del row_rdd
> ```
The connection between Python GC/del and JVM GC is perhaps a bit weaker than we might like. There certainly could be a problem here, but it still shouldn’t be getting to the OOM state.

>
> In my real application:
>  * rows are much larger, perhaps 1MB each
>  * row_rdds are sized to fit available RAM
>
> I observe that after 100 or so iterations of the second loop (each of
> which creates a "job" in the Spark WebUI), the following happens:
>  * pyspark workers have fairly stable resident and virtual RAM usage
>  * java process eventually approaches resident RAM cap (8GB standard)
> but virtual RAM usage keeps ballooning.
>
Can you share what flags the JVM is launching with? Also which JVM(s) are ballooning?

> Eventually the machine runs out of RAM and the linux OOM killer kills
> the java process, resulting in an "IndexError: pop from an empty
> deque" error from py4j/java_gateway.py .
>
>
> Does anybody have any ideas about what's going on?  Note that this is
> local mode.  I have personally run standalone masters and submitted a
> ton of jobs and never seen something like this over time.  Those were
> very different jobs, but perhaps this issue is bespoke to local mode?
>
> Emphasis: I did try to del the pyspark objects and run python GC.
> That didn't help at all.
>
> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>
> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>
> Cheers,
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9