OOM Error

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

OOM Error

Ankit Khettry
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Upasana Sharma
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Ankit Khettry
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Chris Teoh
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Ankit Khettry
Thanks Chris

Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Best Regards 
Ankit Khettry 

On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <[hidden email]> wrote:
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Chris Teoh
You can try, consider processing each partition separately if your data is heavily skewed when you partition it.

On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, <[hidden email]> wrote:
Thanks Chris

Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Best Regards 
Ankit Khettry 

On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <[hidden email]> wrote:
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Ankit Khettry
Still unable to overcome the error. Attaching some screenshots for reference.
Following are the configs used:
spark.yarn.max.executor.failures 1000 spark.yarn.driver.memoryOverhead 6g spark.executor.cores 6 spark.executor.memory 36g spark.sql.shuffle.partitions 2001 spark.memory.offHeap.size 8g spark.memory.offHeap.enabled true spark.executor.instances 10 spark.driver.memory 14g spark.yarn.executor.memoryOverhead 10g

Best Regards
Ankit Khettry

On Sat, Sep 7, 2019 at 2:56 PM Chris Teoh <[hidden email]> wrote:
You can try, consider processing each partition separately if your data is heavily skewed when you partition it.

On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, <[hidden email]> wrote:
Thanks Chris

Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Best Regards 
Ankit Khettry 

On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <[hidden email]> wrote:
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Screenshot 2019-09-07 at 3.24.01 PM.png (142K) Download Attachment
Screenshot 2019-09-07 at 3.24.48 PM.png (543K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Chris Teoh
It says you have 3811 tasks in earlier stages and you're going down to 2001 partitions, that would make it more memory intensive. I'm guessing the default spark shuffle partition was 200 so that would have failed. Go for higher number, maybe even higher than 3811. What was your shuffle write from stage 7 and shuffle read from stage 8?

On Sat, 7 Sep 2019, 7:57 pm Ankit Khettry, <[hidden email]> wrote:
Still unable to overcome the error. Attaching some screenshots for reference.
Following are the configs used:
spark.yarn.max.executor.failures 1000 spark.yarn.driver.memoryOverhead 6g spark.executor.cores 6 spark.executor.memory 36g spark.sql.shuffle.partitions 2001 spark.memory.offHeap.size 8g spark.memory.offHeap.enabled true spark.executor.instances 10 spark.driver.memory 14g spark.yarn.executor.memoryOverhead 10g

Best Regards
Ankit Khettry

On Sat, Sep 7, 2019 at 2:56 PM Chris Teoh <[hidden email]> wrote:
You can try, consider processing each partition separately if your data is heavily skewed when you partition it.

On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, <[hidden email]> wrote:
Thanks Chris

Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Best Regards 
Ankit Khettry 

On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <[hidden email]> wrote:
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Sunil kalra
Ankit

Can you try reducing number of cores or increasing memory. Because with below configuration your each core is getting ~3.5 GB. Otherwise your data is skewed, that one of cores is getting too much data based key.

spark.executor.cores 6 spark.executor.memory 36g

On Sat, Sep 7, 2019 at 6:35 AM Chris Teoh <[hidden email]> wrote:
It says you have 3811 tasks in earlier stages and you're going down to 2001 partitions, that would make it more memory intensive. I'm guessing the default spark shuffle partition was 200 so that would have failed. Go for higher number, maybe even higher than 3811. What was your shuffle write from stage 7 and shuffle read from stage 8?

On Sat, 7 Sep 2019, 7:57 pm Ankit Khettry, <[hidden email]> wrote:
Still unable to overcome the error. Attaching some screenshots for reference.
Following are the configs used:
spark.yarn.max.executor.failures 1000 spark.yarn.driver.memoryOverhead 6g spark.executor.cores 6 spark.executor.memory 36g spark.sql.shuffle.partitions 2001 spark.memory.offHeap.size 8g spark.memory.offHeap.enabled true spark.executor.instances 10 spark.driver.memory 14g spark.yarn.executor.memoryOverhead 10g

Best Regards
Ankit Khettry

On Sat, Sep 7, 2019 at 2:56 PM Chris Teoh <[hidden email]> wrote:
You can try, consider processing each partition separately if your data is heavily skewed when you partition it.

On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, <[hidden email]> wrote:
Thanks Chris

Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Best Regards 
Ankit Khettry 

On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <[hidden email]> wrote:
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry
Reply | Threaded
Open this post in threaded view
|

Re: OOM Error

Ankit Khettry
Sure folks, will try later today!

Best Regards
Ankit Khettry

On Sat, 7 Sep, 2019, 6:56 PM Sunil Kalra, <[hidden email]> wrote:
Ankit

Can you try reducing number of cores or increasing memory. Because with below configuration your each core is getting ~3.5 GB. Otherwise your data is skewed, that one of cores is getting too much data based key.

spark.executor.cores 6 spark.executor.memory 36g

On Sat, Sep 7, 2019 at 6:35 AM Chris Teoh <[hidden email]> wrote:
It says you have 3811 tasks in earlier stages and you're going down to 2001 partitions, that would make it more memory intensive. I'm guessing the default spark shuffle partition was 200 so that would have failed. Go for higher number, maybe even higher than 3811. What was your shuffle write from stage 7 and shuffle read from stage 8?

On Sat, 7 Sep 2019, 7:57 pm Ankit Khettry, <[hidden email]> wrote:
Still unable to overcome the error. Attaching some screenshots for reference.
Following are the configs used:
spark.yarn.max.executor.failures 1000 spark.yarn.driver.memoryOverhead 6g spark.executor.cores 6 spark.executor.memory 36g spark.sql.shuffle.partitions 2001 spark.memory.offHeap.size 8g spark.memory.offHeap.enabled true spark.executor.instances 10 spark.driver.memory 14g spark.yarn.executor.memoryOverhead 10g

Best Regards
Ankit Khettry

On Sat, Sep 7, 2019 at 2:56 PM Chris Teoh <[hidden email]> wrote:
You can try, consider processing each partition separately if your data is heavily skewed when you partition it.

On Sat, 7 Sep 2019, 7:19 pm Ankit Khettry, <[hidden email]> wrote:
Thanks Chris

Going to try it soon by setting maybe spark.sql.shuffle.partitions to 2001. Also, I was wondering if it would help if I repartition the data by the fields I am using in group by and window operations? 

Best Regards 
Ankit Khettry 

On Sat, 7 Sep, 2019, 1:05 PM Chris Teoh, <[hidden email]> wrote:
Hi Ankit,

Without looking at the Spark UI and the stages/DAG, I'm guessing you're running on default number of Spark shuffle partitions.

If you're seeing a lot of shuffle spill, you likely have to increase the number of shuffle partitions to accommodate the huge shuffle size.

I hope that helps
Chris

On Sat, 7 Sep 2019, 4:18 pm Ankit Khettry, <[hidden email]> wrote:
Nope, it's a batch job. 

Best Regards
Ankit Khettry 

On Sat, 7 Sep, 2019, 6:52 AM Upasana Sharma, <[hidden email]> wrote:
Is it a streaming job?

On Sat, Sep 7, 2019, 5:04 AM Ankit Khettry <[hidden email]> wrote:
I have a Spark job that consists of a large number of Window operations and hence involves large shuffles. I have roughly 900 GiBs of data, although I am using a large enough cluster (10 * m5.4xlarge instances). I am using the following configurations for the job, although I have tried various other combinations without any success.

spark.yarn.driver.memoryOverhead 6g
spark.storage.memoryFraction 0.1
spark.executor.cores 6
spark.executor.memory 36g
spark.memory.offHeap.size 8g
spark.memory.offHeap.enabled true
spark.executor.instances 10
spark.driver.memory 14g
spark.yarn.executor.memoryOverhead 10g

I keep running into the following OOM error:

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:163)

I see there are a large number of JIRAs in place for similar issues and a great many of them are even marked resolved.
Can someone guide me as to how to approach this problem? I am using Databricks Spark 2.4.1.

Best Regards
Ankit Khettry