Troubleshooting Spark OOM

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Troubleshooting Spark OOM

William Shen
Hi there,

We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap dump after a crash, and we also anticipate that these heap dump will be huge (since our nodes have a large memory allocation) and may be difficult to analyze locally. Can someone share their experience dealing with Spark OOM?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting Spark OOM

Dillon Dukek
Hi William,

Just to get started, can you describe the spark version you are using and the language? It doesn't sound like you are using pyspark, however, problems arising from that can be different so I just want to be sure. As well, can you talk through the scenario under which you are dealing with this error? ie the order of operations for the transformations you are applying.

However, if you're set on getting a heap dump, probably the easiest way would be to just monitor an active application through the spark UI then go grab a heap dump from the executor java process when you notice one that's having problems. 

On Wed, Jan 9, 2019 at 10:18 AM William Shen <[hidden email]> wrote:
Hi there,

We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap dump after a crash, and we also anticipate that these heap dump will be huge (since our nodes have a large memory allocation) and may be difficult to analyze locally. Can someone share their experience dealing with Spark OOM?

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting Spark OOM

Ramandeep Singh
Hi,

Here are a few suggestions that you can try.

OOM Issues that, I have faced with Spark: 
Not enough shuffle partitions.Increase them.
Less memory Overhead settings: Boosting it to around 12 percent. You usually get this as a error message in your executors.
Large Executor Configs: They can be problematic, smaller and larger in number executors are preferred over larger and fewer executors.
Changing GC algorithm



Here are a few tips 




On Wed, Jan 9, 2019 at 1:55 PM Dillon Dukek <[hidden email]> wrote:
Hi William,

Just to get started, can you describe the spark version you are using and the language? It doesn't sound like you are using pyspark, however, problems arising from that can be different so I just want to be sure. As well, can you talk through the scenario under which you are dealing with this error? ie the order of operations for the transformations you are applying.

However, if you're set on getting a heap dump, probably the easiest way would be to just monitor an active application through the spark UI then go grab a heap dump from the executor java process when you notice one that's having problems. 

On Wed, Jan 9, 2019 at 10:18 AM William Shen <[hidden email]> wrote:
Hi there,

We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap dump after a crash, and we also anticipate that these heap dump will be huge (since our nodes have a large memory allocation) and may be difficult to analyze locally. Can someone share their experience dealing with Spark OOM?

Thanks!


--
Regards,
Ramandeep Singh
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting Spark OOM

William Shen
Thank you for the tips. We are running Spark 1.6 (scala), and OOM happens with SparkSQL trying to join a few large dataset together for processing/transformation...

On Wed, Jan 9, 2019 at 3:42 PM Ramandeep Singh <[hidden email]> wrote:
Hi,

Here are a few suggestions that you can try.

OOM Issues that, I have faced with Spark: 
Not enough shuffle partitions.Increase them.
Less memory Overhead settings: Boosting it to around 12 percent. You usually get this as a error message in your executors.
Large Executor Configs: They can be problematic, smaller and larger in number executors are preferred over larger and fewer executors.
Changing GC algorithm



Here are a few tips 




On Wed, Jan 9, 2019 at 1:55 PM Dillon Dukek <[hidden email]> wrote:
Hi William,

Just to get started, can you describe the spark version you are using and the language? It doesn't sound like you are using pyspark, however, problems arising from that can be different so I just want to be sure. As well, can you talk through the scenario under which you are dealing with this error? ie the order of operations for the transformations you are applying.

However, if you're set on getting a heap dump, probably the easiest way would be to just monitor an active application through the spark UI then go grab a heap dump from the executor java process when you notice one that's having problems. 

On Wed, Jan 9, 2019 at 10:18 AM William Shen <[hidden email]> wrote:
Hi there,

We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap dump after a crash, and we also anticipate that these heap dump will be huge (since our nodes have a large memory allocation) and may be difficult to analyze locally. Can someone share their experience dealing with Spark OOM?

Thanks!


--
Regards,
Ramandeep Singh
Reply | Threaded
Open this post in threaded view
|

Re: Troubleshooting Spark OOM

Dillon Dukek
I think most spark technical support people would really recommend upgrading to spark 2.0+ for starters. However, I understand that's not always possible. In this case I would double check to make sure that you don't have a situation where you have a join key that has many records associated with in in one or both datasets. This would cause all of those records to get pushed into a single partition and then can cause your process to oom when you go to process that partition in the next phase.

On Wed, Jan 9, 2019 at 4:24 PM William Shen <[hidden email]> wrote:
Thank you for the tips. We are running Spark 1.6 (scala), and OOM happens with SparkSQL trying to join a few large dataset together for processing/transformation...

On Wed, Jan 9, 2019 at 3:42 PM Ramandeep Singh <[hidden email]> wrote:
Hi,

Here are a few suggestions that you can try.

OOM Issues that, I have faced with Spark: 
Not enough shuffle partitions.Increase them.
Less memory Overhead settings: Boosting it to around 12 percent. You usually get this as a error message in your executors.
Large Executor Configs: They can be problematic, smaller and larger in number executors are preferred over larger and fewer executors.
Changing GC algorithm



Here are a few tips 




On Wed, Jan 9, 2019 at 1:55 PM Dillon Dukek <[hidden email]> wrote:
Hi William,

Just to get started, can you describe the spark version you are using and the language? It doesn't sound like you are using pyspark, however, problems arising from that can be different so I just want to be sure. As well, can you talk through the scenario under which you are dealing with this error? ie the order of operations for the transformations you are applying.

However, if you're set on getting a heap dump, probably the easiest way would be to just monitor an active application through the spark UI then go grab a heap dump from the executor java process when you notice one that's having problems. 

On Wed, Jan 9, 2019 at 10:18 AM William Shen <[hidden email]> wrote:
Hi there,

We've encountered Spark executor Java OOM issues for our Spark application. Any tips on how to troubleshoot to identify what objects are occupying the heap? In the past, dealing with JVM OOM, we've worked with analyzing heap dumps, but we are having a hard time with locating Spark heap dump after a crash, and we also anticipate that these heap dump will be huge (since our nodes have a large memory allocation) and may be difficult to analyze locally. Can someone share their experience dealing with Spark OOM?

Thanks!


--
Regards,
Ramandeep Singh