Thread spilling sort issue with single task

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Thread spilling sort issue with single task

kumar.rajat20del
Hi Everyone,

I am running a spark application where I have applied 2 left joins. 1st join in Broadcast and another one is normal.
Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality level. It seems data skewness issue.
It is doing too much spill and shuffle write is too much. Following error is coming in executor logs:

INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk (10  times so far)


Can anyone please suggest what can be wrong?

Thanks
Rajat
Reply | Threaded
Open this post in threaded view
|

Re: Thread spilling sort issue with single task

German Schiavon Matteo
Hi, 

One word : SKEW

It seems the classic skew problem, you would have to apply skew techniques to repartition your data properly or if you are in spark 3.0+ try the skewJoin optimization.

On Tue, 26 Jan 2021 at 11:20, rajat kumar <[hidden email]> wrote:
Hi Everyone,

I am running a spark application where I have applied 2 left joins. 1st join in Broadcast and another one is normal.
Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality level. It seems data skewness issue.
It is doing too much spill and shuffle write is too much. Following error is coming in executor logs:

INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk (10  times so far)


Can anyone please suggest what can be wrong?

Thanks
Rajat
Reply | Threaded
Open this post in threaded view
|

Re: Thread spilling sort issue with single task

kumar.rajat20del
Hi , 

Yes I understand its skew based problem but how can it be avoided . Could you please suggest?

I am in Spark2.4

Thanks
Rajat

On Tue, Jan 26, 2021 at 3:58 PM German Schiavon <[hidden email]> wrote:
Hi, 

One word : SKEW

It seems the classic skew problem, you would have to apply skew techniques to repartition your data properly or if you are in spark 3.0+ try the skewJoin optimization.

On Tue, 26 Jan 2021 at 11:20, rajat kumar <[hidden email]> wrote:
Hi Everyone,

I am running a spark application where I have applied 2 left joins. 1st join in Broadcast and another one is normal.
Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality level. It seems data skewness issue.
It is doing too much spill and shuffle write is too much. Following error is coming in executor logs:

INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk (10  times so far)


Can anyone please suggest what can be wrong?

Thanks
Rajat
Reply | Threaded
Open this post in threaded view
|

Re: Thread spilling sort issue with single task

German Schiavon Matteo
Well if your data is skewed I don't think it can be avoided but mitigated using skew techniques.

I'd recommend you to take a look at "salted join" maybe.



On Tue, 26 Jan 2021 at 11:29, rajat kumar <[hidden email]> wrote:
Hi , 

Yes I understand its skew based problem but how can it be avoided . Could you please suggest?

I am in Spark2.4

Thanks
Rajat

On Tue, Jan 26, 2021 at 3:58 PM German Schiavon <[hidden email]> wrote:
Hi, 

One word : SKEW

It seems the classic skew problem, you would have to apply skew techniques to repartition your data properly or if you are in spark 3.0+ try the skewJoin optimization.

On Tue, 26 Jan 2021 at 11:20, rajat kumar <[hidden email]> wrote:
Hi Everyone,

I am running a spark application where I have applied 2 left joins. 1st join in Broadcast and another one is normal.
Out of 200 tasks , last 1 task is stuck . It is running at "ANY" Locality level. It seems data skewness issue.
It is doing too much spill and shuffle write is too much. Following error is coming in executor logs:

INFO UnsafeExternalSorter: Thread spilling sort data of 10.4 GB to disk (10  times so far)


Can anyone please suggest what can be wrong?

Thanks
Rajat